Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Bioconductor for Genomic Data Science

100 ratings

Johns Hopkins University

100 ratings

Course 6 of 8 in the Specialization Genomic Data Science

Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

From the lesson

Week Four

In this week, we will cover Getting data in Bioconductor, Rsamtools, oligo, limma, and minfi

- Kasper Daniel Hansen, PhDAssistant Professor, Biostatistics and Genetic Medicine

Bloomberg School of Public Health

In this session, we will discuss count based analysis of RNA-sequencing data.

We imagine or we have data that has been summarized at a feature level,

whereby a feature I mean either a gene or an exon.

So for each sample, we know how many reads match your particular feature.

And we have that summarized as an entry, so

we know exactly how many reads we don't have like fractional measurements.

Let's take an example data set, the airway packets and see how that data looks like.

This is naturalistic order in a summarized experiment, which

we have here.

We can see that we have eight different samples,

and we have 64,000 different features.

We have entered accounts inside of the features, and that tells us

something about the expression level of the different data set.

This is completely unnormalized data.

The variable of interest for this particular comparison is a dex covariant,

which tells us whether the sample belongs to the treatment group or

the untreatment group.

As always in R, the reference level for factor is the first level.

and in this case it's a treatment.

So if we start fitting models to the data here,

we're going to find differences from treated to untreated.

That's a little unnatural, so we use the read level function and

we reset or we change the reference level.

So now you can see here that the reference level is now the untreated catalog.

This summarized experiment contains a rich piece

of information on what the gene model is.

For example, we can see it for the first gene,

we have the 17 exons that are part of that particular gene.

So getting this gene by a sample account matrix is not easy, and

there are no consensus way of doing it.

In order to do it you have to get an annotation, or you have

to select some features, and you have to select a way of counting the overlaps.

Here, you particularly have to pay attention to what happens for

a gene that has multiple transcripts, as we see a lot in humans.

Are we interested in exons that belong to all transcripts from the gene or

are interested in exons that belong to any one transcript of the gene?

It's well known, or

it's known that the choices you make in constructing this count matrix does

have some impact on the statistical probitors of the data and your results.

And there's no consensus on the best way to do it.

You can do this using values packages in R there's a feature

counts function from an R package called R self read.

You can use list.

You can use genomic alignments and urge to counting and

finally a very popular Python library called ht seek is also of use for this.

And then a lot of people have rolled out their own solutions for it,

including myself.

So we will take as a starting point this account matrix, and we're going to show

how we fit statistical models that are very similar to functionality that's very

similar to what we have seen in the Limmer packets to this type of data.

The two leading packages for analyzing this type of data is a package called gitR

from the same authors as lima, or TEC2 from another group.

That's obviously a continuation of the TEC package,

TEC2 allows more complicated designs.

So both packets is in some ways, do kind of the same thing.

They fit a class of models called generalized linear

models based on the negative binomial distribution, but

they differ a lot in the specifics of how they implement these models and

how the models for information across the genes

typically to estimate the variability in the data.

So what's going to follow now is not really a thorough statistical

analysis using these two packages.

It's rather an extremely shallow walk-through of the steps

you go through in order to use the packages, really are using them and

getting believable results out of this data set.

The other data sets requires more work and

understanding the statistical problems within the models, and how to.

We're going to start off by just running the edgeR package, so let's load that.

So the edgeR pack is in the same as Limmer,

it has a number of package specific containers and

unlike Lemmer, we can't just run edgeR directly on summarized experiments.

So the first thing we have to do is we have to take our data from the summarized

experiment and converted into a limma class.

The limma class converts into an edgeR data class.

The data class is called DGEList and we construct it like this.

Basically here we have given it a counts matrix,

which is pretty straight forward and

we have given it a group variable which is a main group of interest in this set up.

Now there are two pieces of information that would be nice to

take into this PGE list.

One piece of information is the phenodata and another piece of information

is information about which genes you actually profiled in the experiment.

The phenol data comes into the g-list through something called the samples log.

And weirdly enough, we can not give

the samples information directly to the PG list constructor.

So we see that a sample thing been constructed, but

we couldn't give it additional columns when we constructed our DGE thing here.

So if you want to take out additional data on these additional co-variance and

these sample and put them in here, we have to create a new data frame and

put in these samples right here.

So that's what happens in the next couple of commands,

I merge the existing

data frame from the dg-samples with the call data from the airway data set.

I have to convert my call data into

classic data frame because that's what XR uses.

And the bi argument here, means that I'm merging based on the role names.

The two data frames have the same role names.

So now I have a slightly more rich set of phenotypes here.

I also would like to put information about the genes in here.

And here we are a little big constrained by the fact that the DG list

has, the way it installs information and

genes is in the data frame, and what we had from the airways data set was

the representation of exactly what we accounted on.

So we can't really put that into the DG list, but we can

do something that is at least a minimum, which says we can give it the G names.

If we look at [INAUDIBLE],

we don't have any information.

At least we would like to have some names on the genes in there.

So I do that here.

I basically take the names of the row ranges, and that's, yeah,

that's basically the names of the different genes profile, and

put it into a data frame and put it into the gene slot.

So let's what came out of this.

Basically a name column with the name of the gene.

So now we've sent it off and the first thing we do in edge size we calculate

something called normalization factors or effects of library sizes.

So a lot of people or at least when RNA sequences started out,

you use the total number map reads to calculate something called RPKNs.

It has later been shown that you have to spend a bias for

different library sizes and different sequencing dips and the different samples,

but the best way to do this is not just to count how many reeds were met.

Why that is, is a slightly longer discussion ,but limit has a port for

estimating the effects of library size and in the factors function and,

let's see here.

You can see that inside this samples data frame there's now a factual norm.factors,

which tells you about how much the different data sets have to

be scaled up and down in order to have the same effective library size.

We can see that there is actually stored into the field data here, and

we can see that the different samples have actually been sequenced to a very similar

depth because the known factors are very close to each other.

Now the next step we do when we our model is

that we estimate the dispersion of basically the variability of the data.

And we do that in two steps,

there are various ways you can estimate the dispersion.

We're going to estimate something called a checkwise dispersion.

And in order to do that, we first have to estimate a common dispersion, that's kind

of estimating basically a situation where a dispersion parameter assuming

that the dispersion for all genes are the same, which is probably unrealistic.

And then once that's estimated, we can estimate a checkwise dispersion.

It looks a little weird that we have to call

two very similar commands after each other, but this is intentional.

So now we have estimated the effective library size, we have estimated

the dispersion parameters, and we are ready to do some model fitting.

As usual we store our information about the design and

our personal interest into a design matrix.

This should be familiar to you by now, at least for

the simple case where we have a two group comparison.

The design matrix looks as we usually see and now we fit the model.

We fit the GLM for each gene which we do use in the GLM fit function.

The next step is to figure out for our coefficients, our contrast of interest,

which genes are most significant there.

To do that, we can do this in various ways, good example here,

I do it using a test, using the GLM, test.

The coefficient equal 2 means that I'm testing the second coefficient

in the sign matrix, which is equivalent to the coefficient representing,

which is equal to the coefficient representing differences between

the treatment and the un-treatment group.

And I can get a list of my top tags, the ones that are most

differentially expressed between cases and

controls, treatment and non-treatment.

So this is how we do it when it's SR, let's take a look at how we do it with

Seq2, and let's start by loading a library.

So, like SR, in order to use DESeq2,

you have to put your data into DESeq2 container.

Unlike SR, it's actually quite easy to do because we can convert it

directly from a summarized experiment.

We set it up like this and now here we see an important thing to pay attention to

that we need to store the design matrix inside the data object.

Furthermore when we do things later on, so

if you have multiple variables in the design,

which you might need when you have more complicated designs,

what PEseek focuses on is the last variable you have listed.

Now once we have severed off it's actually quite easy to fit the model because all

you run is you run DESeq on your data container, and it estimates some things.

You can see that it's actually very similar to what happens with SR right you

can see it estimates size factors or library sizes or norm factors.

Then it estimates dispersions and then it fits the model and test.

So this is really a lot of different SR commands wrapped into one.

Now we get the results out by calling results and the resulting object.

And this data frame, or this object here is not going to be solid.

We'll have a look at it.

It's perhaps a little hard to see but this is not sorted according to for

example, the adjusted p-value which we have over here to the right.

So I order it according to the adjusted p-value.

And I print it again and now we have kind of

the top differentially expressed achieved according to DESeq2.

And if you scroll up and you do a little bit of comparison,

you'll notice that there's actually no overlap in the top

five most differently expressed genes between the two methods.

So this is a quick run-down on how you kind of utilize these packages.

There's a lot of information living in it and

in the papers lying behind the packages.

And don't take this as a full,

don't take this session as the only thing you need to know about

using ExamPC are these are great statistical tools, and

they require some understanding before they can be successfully used.

But now you have a little bit of an idea of how to do a very simple analysis.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.