An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

133 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the critical steps in an Economic Analysis is normalizing the samples so

that they have common distributions.

You particularly want to do this when the distributions are likely being driven

by some technical variables.

So, I'm going to show one example of one of the most popular ways to normalize data

across samples.

So the first thing that I'm going to do,

is I'm going to set up the plotting parameters like I typically used them and

then I'm going to load the libraries that I need in this case pre-process course,

the main library that we're going to be using for this example and

so again I'm going to load in this state as a combination of Montgomery and

pick role data sets, for doing these examples and I'm going to basically

extract out the phenotype data, the expression data and the featured data so

that they're easier for us to work with when doing these examples.

And so, once I've done that I'm going to transform the data and

remove sort of the L line values.

So now I'm left with this expression data that has about 5862 genes in it,

129 samples.

So the first thing I want to do,

is I'm going to show the distribution of each of these different samples.

So as we talked about an Exploratory Analysis, one way to do that is to

basically make a plot of the density of each of these samples.

So here, I'm going to make a plot of the density of the first,

the values from the first sample.

I'm going to use this first color from this color amp, and

I'm going to plot it here on this.

So here is the distribution for the first sample and then what I'm going to do is

I'm actually write a loop that loops over each of the other sample,

so it's going to go from 2 to 20, because I already did sample one and I'm going to

make 20 of the samples I'm going to make a density platform so in each one I'm

going to use lines to overlay another line from the coloring on top of that,

so when I do that I can see that some of those samples have nearly identical values

and some of them have big distributional differences between the samples.

That's likely due to technology and not due to biology.

So one thing that we can do is do quantum normalization like we talked about.

That's basically going to force the distributions to be exactly the same.

And so the way I'm going to do that is using that pre-process core package,

I'm going to use the normalize.quantiles function.

And then I'm going to convert this to a matrix and apply it.

So now I have a new data set,

what this returns is a new data set of the same size.

So if I look at dimensions of edata and the dimensions of norm edata.

We need to set this exactly the same size, but

where things have been quantile normalized.

So now, again, I can make a density plot for the normalized data

of the distribution and it looks like this after normalization.

And then I can again loop over the first 20 samples and

add lines and over layed and on top of that plot.

And so

you see when I do that, they basically all land right on top of each other.

Now there's a little bit variability down here on the low end that's because

the quantiles for the very low values are difficult to match up,

so often you'll see a little bit variation here in the low values or

the really high values in the quantitative normalization.

But for the most part, the distributions lay exactly on top of each other now.

And so the cool thing here is that it hasn't removed,

this is basically removing both differences in the distribution but

it hasn't removed the gene by gene variability.

So here, what I'm going to do,

is I'm going to plot the normalized first gene and I'm going to color it by study.

And you can still see that there's a difference between the two studies.

Even though overall the bulk distributions are about the same, you can actually

see that in any individual case you might have differences between the study.

So to see that more clearly we can actually do decomposition so

if we do the svd on the normalized data,

so we're going to subtract the runMeans again so that we can

see the first pattern of variation would be something that varies across each gene.

And then I'm going to plot that first,

singular vector versus the second singular vector, that usual plot that people make.

I can color that in my study and you can see that they're still separated by study.

So even though we've done quantile normalization, and

the samples all have sort of the same distribution,

we haven't removed the sort of gene to gene variability in expression patterns.

So that's an important thing to keep in mind,

that even though you've normalized out the total distribution, you can still have

artifacts like batch effects or other types of artifacts in the data set.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.