An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

94 ratings

Johns Hopkins University

94 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

By it's nature genomics data is usually very high dimensional, and soÂ you want to reduce that dimension when visualizing or modeling the data.Â So here I'm going to do the typical set up steps to get my plottingÂ parameters like I like them and to load the libraries that we'll need.Â In this case, it's mostly the base packages that we're going to be using, andÂ then I'm going to load in a data set here from this URL.Â There's actually a combination of two data sets from Montgomery andÂ a pick roll paper.Â And so they are actually two different populations measured in two differentÂ labs and that'll be useful for this lecture.Â And so I'm going to load the data in and I'm going to, again, extract outÂ the phenotype data, the expression data, and the feature data for this data set, soÂ that we can use that data to make some plots and to do some dimension reduction.Â

So again, just to make it a little bit easier visualize,Â I'm going to sort of subtract out all the rows whereÂ the row mean is less than 100, so I reduce the size of the data set.Â And then I'm going to apply the log transform soÂ that it will be on a scale that's a little easier to work with.Â So the next thing that I'm going to do is I'm going to actually center the data,Â

because when we're doing the singular value decomposition,Â if you don't center the data, if you don't remove the row means of the data center,Â the column means of the data set,Â then the first singular value of your vector will always be the mean level,Â since that will always explain the most variation in genomic experiment.Â And we actually want to see variation between samples or between genes,Â so we're going to remove that means versus variation andÂ look at the ones that are different between genes.Â And so once I got that center data set,Â I can apply the svd function to calculate the singular value decomposition.Â So this singular value decomposition has three parts to it.Â These three matrices d, u, and v.Â So d is the diagonal matrix, and soÂ it just returns the diagonal elements of that matrix for you.Â So there's, in this case, the data set that we're dealing withÂ has 129 columns, so there's 129 singular values.Â

And then the other components, the p and u components,Â have 129 values for the v component.Â So that's basically telling me something about the variation across genes, andÂ then the variation across samples is something about you.Â And so the first thing that we might want to do is plot the singular values.Â And so we're going to plot the d values and this would be the singular values.Â

And I'm going to make those in blue.Â So here I can see those singular values plotted versus their index, soÂ they're ordered from the biggest to the smallest.Â And so then the next thing I want to do is plot the variance explained, and soÂ to do that remember that I have to calculate each singularÂ value squared divided by the sum of the singular values squared.Â And so once I've done that, I've calculated the variance explained.Â And so, I can plot that, again in blue, on the same kind of plot, andÂ I can see that the first singular value explains more than 50% of the variant.Â So it's a highly explanatory variable.Â So then I can make a plot of that and see what could that variable be?Â Again, I'm going to make a plot that's two panels, soÂ I use the par m f equals one two.Â And then I'm going to plot the first two igungenes orÂ right singular vectors, or principal components.Â You'll see in a minute that they're not exactly the principal components butÂ people use them sort of interchangeably.Â So I plotted that first principal component and then I plot the second one.Â

And so the first thing that people often do is they might wantÂ to color these by different variables to see if there's something going on.Â

To do that they usually plot.Â It's very common to plot the first singular vector versus the secondÂ singular vector, right singular vector.Â So here I'm going to set it up so that there's a one-by-one plot again.Â And so if I make that plot, I can see there's this pattern here, andÂ the thing that people often do is they make this plot,Â only they color it by a particular variable.Â So in this case, I'm going to color the PCs by what study they come from,Â so here I'm setting the color to be the numeric version of the study variable.Â

And so I remove the color from the previous plot, and so you can see here, ifÂ you look in the PC1 axis, the two studies have very different values of the PC.Â So it seems that one of the big sources of signals in the data set is which studyÂ the two data sets come from.Â You can see this also, a way that people often do this is to make a box plot ofÂ that first principle component, because you can see that's the one that separatesÂ the two studies versus the study variable.Â

And then it's always a good idea to show as many of the data points as possible soÂ you can overlay the data points on top of the box plot byÂ plotting the same singular vector versus a jittered version of the study variable.Â In coloring it by the study variable, you can see that there's a big difference inÂ that first principal component between the Montgomery and the Pickrell studies.Â So, that's how you do the singular value decomposition.Â So, to do the principal components you can use the PR comp function andÂ apply it to the same data set.Â And so, even though I've been sort of using the two terms interchangeably theyÂ are not quite the same thing.Â So if I plot the first principal component verses the firstÂ singular vector they're not the same thing, andÂ that's because I haven't actually scaled them in the same way.Â So it turns out if you actually scale the data by removing, so now I'mÂ doing a second set of centering, but here what I'm doing is I'm actually subtractingÂ

the column means rather than subtracting the row means.Â Then I have a data set that's centered by column instead of centered by row,Â and so then I can calculate the singular value decomposition on that.Â [NOISE] And when I do that and then I plot the first principalÂ component versus the first singular vector from the column center data,Â I actually get that they're identical to each other.Â And so basically what's going on is that if you column center the data then do SVD,Â you get exactly the principal components because the principal componentsÂ are calculating something about the variability between the columns whenÂ they're doing that.Â And so you can get PCs andÂ SVDs that actually compute the exact same thing if you do the centering right.Â One thing to keep in mind is that outliers can really drive these decompositions,Â so to illustrate that I'm going to just take our edata centered,Â I'm going to assign it to the new variable edata outlier.Â And then I'm going to make one of those values really outlined, soÂ I'm going to take this sixth gene and I'm going to multiply it by 10,000.Â So this is now a very outlined gene of very high values.Â So now I'm going to apply the SVD to the outlying dataset andÂ if I plot the original version of this decomposition where I did thisÂ SVD on the dataset without this outlier versus the dataset with the outlier,Â so then I can see that [NOISE] they don't match each other anymore.Â You sort of don't see that the two data sets don't necessarily matchÂ in terms of their singular value decomposition, butÂ you can definitely see that the singular value, or singular vector forÂ the composition with the outlier reflects that outlier quite accurately.Â And so if you plot the first singular vector from thisÂ new decomposition with the outlier verses the outlying value itself you can see thatÂ they're very highly correlated with each other.Â So what's happening is the decomposition is looking for patterns of variation wellÂ if one gene is way higher expressed and on the other ones, then it's going to driveÂ most of the variation in the data set and so it'll be very correlated with it.Â So you have to be careful when using these decompositions to make sure that youÂ pick the centering and scaling so that all of the different measurements forÂ all of the different features are on a common scale.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.