An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

118 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

So one of the things that's very common across all types of genomics and

Â genetic measurements is that you often make a lot of measurements per sample.

Â And so when you have a huge number of measurements, say 20,000 or a 100,000 or

Â a million measurements per sample, you want to be able to visualize and

Â communicate patterns and identify relationships.

Â And the best way to do that is to reduce the dimension, or

Â reduce the number of measurements that you're looking at at any given time.

Â So here I'm going to illustrate this with a really simple example.

Â I've generated a data set here, this is a simulated data set with 40 different

Â measurements for ten different samples.

Â And you can see that their appear to be a two clusters that are driven by

Â measurements in these rows that are very different from each other.

Â So one way that you could reduce the dimension of this is you could just take

Â the average and in the rows and the columns.

Â So here I've taken the row average, so I take every single row and

Â I calculate it's average and plot it in this graph.

Â So there's the average for row one and the average for row two and so forth.

Â So you can see because there's this block pattern here, there's a difference in

Â the overall, average for that row compared to rows that don't have that pattern.

Â Similarly, if I take the average for this column and plot it here,

Â and the average for this column and I plot it here, and so forth.

Â I can see, again, the difference in the patterns between these and these by

Â looking at the difference in the pattern between these two groups of points here.

Â Now this works really well when there's only one pattern,

Â and all the effects go in the same direction.

Â But in general it's not that easy.

Â So you might want to think about different ways of doing this.

Â And so there's two related problems here.

Â Imagine you have this data matrix X like the one we saw in the previous example.

Â And you want to find a new set of multivariate variables that

Â are uncorrelated with each other and that they explain as much of the variability

Â across those rows or across the columns as possible.

Â A related mathematical idea is to find the best matrix

Â that's an approximation to the original matrix that's lower rank or

Â has fewer variables and that explains the original data.

Â These two goals are a little bit different.

Â One is statistical, one is data compression or mathematical, but

Â it turns out that they can have very similar solutions.

Â So here is the solution, it's a singular value decomposition.

Â Imagine that you have a data matrix here, so here imagine you have the genes or

Â features, snips in the rows, and

Â then you have the samples over here in the columns of the matrix.

Â Then you can decompose it into three matrices, U, D, and V transpose.

Â So those three matrices comprise the left singular vectors,

Â the singular values and the right singular vectors in the matrix.

Â So in the left singular vectors we see patterns that exist across the different

Â rows of the data sets.

Â So this is equivalent to like taking the row means that we

Â talked about in the previous slide in the sense that

Â it's trying to identify patterns across the rows.

Â The D matrix tells you how much of each of the patterns that you have in the U matrix

Â explain.

Â So it's a diagonal matrix, so there's only elements along these diagonals and those

Â elements quantify how much of the variance is explained by the different patterns.

Â Now the columns of V transpose tell you something about the relationship with

Â the column patterns or the patterns in the rows that we saw a minute ago.

Â So if I took the column means, this would be equivalent to looking at the column

Â means in the sense that it's looking for patterns that exist across multiple rows.

Â So there are a couple of mathematical properties of these.

Â They are calculated one at a time, they are orthogonal to each other.

Â That means they are uncorrelated with each other,

Â the columns of V transpose in the rows of U.

Â The columns of V transpose describe patterns across genes, and

Â the columns of U describe patterns across arrays.

Â So the other thing to look at is, it's a little hard to read here, but you see that

Â D sub I, which is the diagonal element, the ith diagonal element.

Â So D sub one is that element, and D sub two is this element, and so forth.

Â If you take D sub I squared and divide it by the sum of the remaining D of I,

Â you get the percentage of variance explained by the ith column of V.

Â So let's illustrate how that looks.

Â Remember our example here where we have 40 say genes and

Â rows and two sets of groups here, and so what we're looking for is patterns.

Â If we did the row means and the column means, we saw those patterns come out.

Â If you look at the singular value decomposition of this matrix you see

Â something similar emerge.

Â So, the first left singular vector, that's the first column of U.

Â It turns out to look like the same.

Â You see that there's a pattern for these rows and a pattern for

Â these rows that corresponds to the pattern in this matrix.

Â Similarly, if you look at the pattern that exists across by looking at the different

Â columns of this matrix, you see that there's a difference between these columns

Â and these columns and that appears in the first right singular vector.

Â The first ith in gene or the first principle component.

Â And that there's a big difference, again, between the two groups.

Â So it's not exactly the same as taking the means, that's important, and

Â it will become more important as you see later in the lecture.

Â But right now it does show that you can sort of pull out

Â low dimensional patterns from high dimensional data using this decomposition.

Â Now if I plot the D values,

Â the values along the diagonal of that matrix that's been calculated.

Â I can see that there's one large D value, and

Â then the rest the values sort of tail off.

Â So this is even more clear if you basically take each D value, square it and

Â divide by the sum of the remaining D values.

Â Then you get the percentage of variance explained.

Â And so nearly 40% of the variance in that matrix is explained by that first pattern,

Â which isn't surprising cause if you look at the pattern, it looks like about 40 or

Â 50% of the rows have a strong pattern in them.

Â So this comes even more clear if you make the matrix like stupid.

Â So, for example, suppose that every single rose is exactly the same.

Â And there's a high value, it's constant, then a low value, it's constant.

Â In this case there's only one pattern in the data set.

Â There's no random variation at all and so the first singular value is very,

Â very high and the rest of the singular values are essentially zero.

Â If you calculate the percentage of variants explained then the first pattern

Â explains all of the variants in the matrix.

Â Which makes sense because there's only one pattern in the matrix.

Â So, the other thing that you can look at is you can look at how,

Â what happens if you have multiple patterns.

Â And so, for example here we're going to generate a new matrix.

Â And so this matrix, again has a small number, let's say 40 or so

Â rows and ten or so columns.

Â And the idea here is you're looking for two different patterns.

Â So first there's a pattern and that's high for the last five samples and low for

Â the first five samples.

Â And then there is a pattern that isolates from low to high, low to high.

Â So you can see that in the data set here too.

Â Here, you can see a block of samples that have high values.

Â And here you can see this oscillating pattern.

Â Low, high, low, high, low, high.

Â So in a matrix like this, when you do the singular value decomposition,

Â you'd like to identify more than one pattern.

Â And so it turns out, if you take the first two right singular vectors, you do get

Â two different patterns, but they're not quite what you would hope they would be.

Â So we generated the data set with a pattern that was low for the first five

Â samples and then high for the next five, and then oscillating low, high, low, high.

Â But it turns out the first right singular value vector is a combination of those

Â two things.

Â So you can see the first five samples are lower than the next five samples, but

Â there's still oscillation within each of these two groups.

Â So it turns out the second right singular vector also shows a similar behavior.

Â It's got a difference between the first five samples and the next five samples.

Â But then there's an oscillation in the pattern between within the groups.

Â So what does this mean?

Â It means there's a singular value decomposition is finding patterns that

Â explain the most variation.

Â But it doesn't necessarily directly decompose the patterns due to variables

Â that you think that you might care about.

Â And so it's not quite a perfect recapitulation

Â of the variables that generated the data set, but it does still give you some idea

Â of the patterns that you might see in the data set.

Â Again if you calculate the percentage of variance explained, so

Â here's the D values plotted from one to ten, because it's a diagonal matrix.

Â You can also see the percentage of variance explained is still very high by

Â the first pattern and the second pattern, and then it drops off.

Â So again we're kind of getting some idea of the dimension of

Â the true underlying variables that are sort of contributing to that data set,

Â as well as what they look like.

Â But they're not exactly the same, because of this requirement of orthogonality.

Â So how is this applied?

Â I was going to show you one example from genetics here.

Â So in this example, they took a genetic matrix that consisted of,

Â in the rows they had many, many, many snips, so single nucleotide polymorphisms.

Â And in the columns they had many samples from people from different

Â places throughout Europe.

Â And, so they calculate the first two singular vectors which

Â are equivalent to the first two principal components, PC1 and PC2 here.

Â And when they plot them, you can see that if you plot each sample according

Â to these two principal components, you see that they cluster by geography.

Â So for example, here you see the sort of the Spanish and

Â Portuguese samples down here.

Â You see Italian samples over here and so forth.

Â So you get basically an identification of the structure and

Â the genetic data that corresponds to the geographic structure.

Â And that makes sense because genetics tend to be associated or have patterns that

Â are associated with population structure, which is then associated with geography.

Â Because people tend to have a relationship with and

Â childrens with people that are close to them.

Â So there's a relationship between geography and population structure.

Â So another way this can be used is to identify patterns in a data set.

Â So again here I'm plotting PC1, or

Â Singular Vector One, versus Singular Vector Two.

Â And so what I'm trying to do is I'm trying to find distances between samples.

Â And I'm looking at the right singular vector that's looking at patterns in

Â the samples across rows.

Â And so here, each dot represents one sample and they're colored by

Â whether they're a human or a mouse sample from this specific study.

Â And then the symbol comes from what tissue did they come from..

Â So, if you look at this data set,

Â the distance between any two points in the plot is supposed to be a sort of

Â an estimate of the distance between those two samples.

Â If these PCs explain a large percentage of the variation, or

Â the singular vectors explain a large percentage of the variation.

Â Then that's a really close approximation of the distance between the two samples.

Â If they're not very close to each other then it's not a very close, sorry,

Â if they don't explain a large percentage of variation then it's not a very

Â good approximation.

Â So here you can see, for

Â example, that the testing samples from human and mouse are close to each other.

Â And the liver samples for human and mouse are also close to each other.

Â If you actually do a clustering you see that that's true.

Â You see testees cluster close to each other as do liver.

Â And so what does this plot suggest?

Â This would suggest that there's close relationship between tissues

Â than there is between species.

Â And so, another way that you can use this is you can actually try to identify

Â effects that are different between groups.

Â So here, what, this is it's an actual example that comes from this book.

Â And so, in this example,

Â they've actually taken a real data set and made a subset of that data set.

Â And so, the subset of the data set that they've taken is from two

Â different batches.

Â But then, within those two different batches they've taken some samples from

Â men and some samples from women and

Â they've looked at genes on the Y chromosome.

Â And so, here you can see, here are the women and the men from batch one, and

Â here are the women and the men from batch Two.

Â And so, you can see, for example,

Â that there are some genes that are very different between the two batches.

Â But there also are some genes that are different between the two sexes.

Â And so if you do the first singular value of this data set of the first

Â principle component,

Â you actually see that the biggest effect that you see is the batch variable.

Â So you can see that batch one and batch two are very different from each other.

Â And so you can use that to detect different variables in the data set.

Â Whether it's batch effects or whether it's group differences

Â by decomposing the data into smaller variables.

Â This is widely used like I said for batch effects.

Â This often comes up in technical artifact correction which we'll talk about later.

Â There are also many other decompositions people use.

Â They use multidimensional scaling,

Â independent component analysis, non-negative matrix factorization.

Â We're not going to cover those in this class,

Â because they're not as widely used as PCA and SVD, but they are other

Â matrix decompositions or ways to reduce the dimension of data that you might see.

Â If you want a lot of more discussion of this you can see it in this

Â Advanced Statistics for Life Sciences course, where they go into pretty deep

Â detail about these different matrix decompositions.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.