An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

94 ratings

Johns Hopkins University

94 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Often the outcome isn't a continuous variable, and soÂ fitting a standard linear regression model isn't the right way to go.Â I'm going to talk about two specific examples here.Â But the first I'm going to do is, as always, I'm going to set up myÂ I'm plotting parameters to be like how I like them to be,Â and then I'm going to load some libraries that we're going to need.Â Specifically here, we're going to be looking at the snpStats andÂ the EC2 packages, but the broom package will be helpful for cleaning things up.Â

And so, the first thing I'm going to do is,Â I'm going to load this snpdata that we have from the snpStats package.Â And again, I'm going to subset to a smaller set of samples soÂ that I can manage it a little bit easier for this example.Â And so I'm going to take that subset and I'm going to reduce the data set size.Â Okay, so the first thing that I was going to do,Â is I was going to compute the principle components,Â just like I did previously, this takes a couple of immediate steps.Â So first I do the right transformation of the data to get the xx transpose componentÂ and then I calculate then eigenvalues or eigenvectors of that data set.Â And then once I've done that, then I can calculate the PCs.Â So now I've got the PCs matrix.Â So this is what I'm going to be using for adjustment.Â We learned about that in a previous lecture.Â So now I can subset out, I removed this snpdata soÂ it's actually stored in RF format that's a little bit easier to process on.Â But you might want to, for the purposes of this, just take the data set out soÂ that we can deal with it on a scale that's a little bit more manageable for us.Â And so the first thing I'm going to do is, I'm going to look at that snpdata, soÂ I have data on the following snps, for the following people, andÂ then I can get the status.Â So the case control status for the people is in the subject support data set.Â So that's the case control status.Â So I have the status for the thousand people there.Â And so then I can take the first snp and I can extract the data for that snp.Â

So now I have the first snp and the status for that person.Â And I want to fit a generalized linear model relating the two things.Â So the first thing I need to be aware of is that the way that the snpdata is coded,Â shows that there's some 0 values.Â 0 values are actually meant to be missing.Â So I'm going to recode those values as missing values, soÂ R will know what to do with them.Â And then I can fit a GLM model, using the glm command,Â just like I used the l command linear model.Â So I'm going to relate the status, remember that's either a 0 or a 1, orÂ a case control to the snp1 variable.Â So, basically relating the status to the snp on an additive scale.Â And then to make it via logistic regression model,Â I need to tell it to fit the family equals binomial.Â So now that model has been fit, and I can tidy it up.Â Now I get the estimate.Â This is basically the per allele change in the log odds ratioÂ

And then what I can do is, I can actually fit different models.Â So that's the additive model.Â If I want to fit the dominant model, I can just change how I code that variable,Â so I basically require it to have, only if there's either one orÂ two copies of the minor allele then you have risk.Â And so that's going to be with the cases where snp1 is equal to two or three andÂ then I can fit the dominant model by fitting the next glm, where instead ofÂ using the snp variables, the covariant, I fit this dominant dummy variable.Â And so then I get a slightly different estimate forÂ that value because I'm fitting a different model and soÂ it's estimating something a little bit different.Â The other thing I can do is, I can just directly adjust for things likeÂ the principal components, which is what people often need to do because it'sÂ a typical confounder in the population structure in these sorts of models.Â In this case I've just added to my model, my term for the principal components.Â And so then I can look at that model, and it's got estimates for all the principalÂ components which I'm not as interested in, and then the estimate forÂ the snp variable now adjusted for the principal component values.Â If I want to do many of these, you can actually do that quickly in R as well.Â So you can do that by using this snp.rhs.test function,Â which will on the left-hand side should take the case control status orÂ whatever your outcome is.Â On the right-hand side is adjustment variables,Â this is an unadjusted model you just tell it to regress with the 1 on that side.Â And then you tell it where the snpata is and it fits many glms.Â So if we look at what are the names of that object?Â It's an s4 class so it has a slightly different way of accessing it.Â We can look andÂ see that one of the things that you get out of that is the chi squared statistics.Â So this fits sort of a version of the generalized linear model, butÂ calculates the chi squared statistic.Â And so we can plot those chi square statistics here versus what theyÂ would expect.Â So this is again a QQ plot, so this is the quantiles of the expected chi squareÂ distribution versus the observed chi square distribution.Â You see it's not the same.Â You would expect there to be very few signals in the snpdata sets, soÂ you'd expect those actually to not really fall very far from the expected quantiles.Â So we can fix that by looking at the adjustment.Â So here's we're going to basically fit the same model,Â only now we're going to adjust for the pcs.Â So here we've got the status tilt of the pcs.Â So this is the adjustment variables, with the same data set, so we fit that model.Â And then we can look at the chiÂ squared values for this one.Â And let's see here.Â

You can see that they fit much more closely along the quintiles that you wouldÂ expect once you adjust for basically the principle components orÂ the population structure.Â So that's how you can fit generalized linear models forÂ logistic regression models.Â And then for Poisson regression models, we can use a different data set.Â We're going to here use an expression data set.Â The bottomly expression data set, and then what we can do is,Â we're going to filter out those cases where there aren't very many counts.Â

And so once we filtered out those cases, we can just directly fit a Poisson model,Â say gene by gene.Â And the way we would do that, is with the same glm command that used forÂ our logistic progression model.Â But now instead of telling it family equals binomial,Â we're going to say family="poisson", okay?Â So now this will will fit a Poisson regression model.Â So now this tells us something about on the log scale, this is going to tell us,Â sort of the,Â e raised to this power tells us something about the multiplicative difference.Â And this is something about, basically the additive log on the log scale,Â the additive difference between the two strains.Â

And so you can actually fit a negative binomial model, because this modelÂ is again a little bit more constrained in the mean variance relationship.Â And so you can fit a single negative binomialÂ model with the glm.nb function, which will then relateÂ

The expression data for the first gene using a negative binomial model,Â which is a little bit more flexible, it models the variance.Â And so you see that you get relatively similar estimates for the estimates, butÂ the standard errors are actually a little bit bigger in the negative binomial modelÂ because it's accounting for that additional variability.Â So, you can actually do many negative binomial regressions at once,Â using the DeSeq2 package.Â And so one way that we do that is, we have to build a DESeqDataSet, andÂ so you can do that by passing this function, rather long named function,Â DESeqDataSetFromMatrix function, that's a mouthful.Â You can pass it the expression data and the phenotype data.Â And then what the model fit you want it to be.Â In this case, we just want to fit an association with strain.Â

And then you can fit all the negative binomial models at once byÂ doing DESeq to that data set.Â So, it's going to estimate those.Â And it's going to estimate, a smooth relationship, between between the mean andÂ the variance, which is going to then account for when it's doing the analysis.Â And then if you want to get the results out,Â you can actually just use the results function.Â

And then we can look at the statistics that are resulting from all of those fits.Â So these are the statistics now.Â This stat is for the variable that you tested,Â in this case the strain was the model matrix, and so it's going to beÂ this is the coefficient on strain from all of those negative binomial regressions.Â So that's how you can do GLM model fitting, both for Poisson regression andÂ logistic regression in R.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.