An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

93 ratings

Johns Hopkins University

93 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Other type of outcome that's commonly observed in genomic data particularly dataÂ from next generation sequencing is count outcome data.Â So the very common scenario is you're dealing with a situation where you haveÂ a number of reads that overlap a particular region orÂ variant and you want to make a regressional model for those counts.Â So here's an example.Â It's from gene expression datas.Â So, suppose, for example, that you wanted to count how many reads covered each gene.Â So, here are a number different that you could count that.Â You could say just if the read falls entirely within that gene howÂ many counts does it get?Â There's a number of choices that you could actually make for that, but forÂ now let's just assume that you made one of these choices and you've gotta count forÂ each gene, then you can make those counts by calculatingÂ the total number of reads say that cover that gene.Â Now you have a count for each sample and forÂ each gene of the total number of reads that cover that gene.Â So what you want to do is you want to model that distribution and soÂ you might want to build a regression model.Â Again, based on the relationship between the phenotype that you care about andÂ the counts for a particular gene.Â

The most commonly used distribution to model countsÂ in statistics is the Poisson distribution.Â So one thing to keep in mind about the Poisson distribution is that the mean andÂ the variance are the same.Â So for example if you look at a Poisson distribution with a low mean,Â it also has a low variance.Â When you increase the mean you also increase the variability, andÂ if you increase the mean even farther you increase the variability even more.Â So this is a distribution that's very good forÂ count data because it's only positive, and it has other properties that forÂ idealized type distributions it models count data very well.Â But it's very restrictive in modeling the variance.Â So, again, you could fit a regression model, but here, now,Â the regression model is going to be a little bit more complicated.Â So this is an example of the generalized linear model.Â Logistic regression is another example where we takeÂ a function of the expected value of the thing that we care about.Â So here, we have the counts that we care about, conditional on, say,Â the group indicator.Â And so what we're going to do is model the expected value of the counts,Â given our indicator as some function of the data.Â So here, usually what we do is use a link function when you're modeling counts, andÂ the link function is often the log function.Â So you say the log of the counts is going to be modeled as a function of someÂ adjustment variable here, this is usually a normalization constantÂ which models the total sequencing depth, plus the parameter that we care about.Â So again, this is another parameter that we're going to be looking at, that again,Â models on the log scale of the countÂ the relationship with the outcome variable that we care about.Â So this is another way of fitting a regression modelÂ now to a set of count data.Â

And so if you fit a model like this, you get a slightly better fit than you do withÂ using sort of standard linear regression models to count data.Â So here is a set of data where you have the average on the y, orÂ x axis and the variance on the y axis.Â And so here you see the fit from the Poisson model in purple andÂ it turns out that you can even do a little bit better than that if youÂ model the relationship directly between the mean and the variance.Â So remember that the Poisson variable required that the mean andÂ the variance be the same.Â So you have a straight line in the relationship between mean and variance.Â But sometimes that's not exactly true for counts, and so you actually model it asÂ a function of the relationship between the mean and variance.Â So the two most popular techniques for modeling count data andÂ bioconductor are edgeR and DEseq and both of those uses a type of local orÂ smoothed regression to estimate the relationship between mean and variance.Â You can then plug that into a more flexible model,Â the negative binomial model.Â So the negative binomial model allows you to model the mean and variances set usingÂ separate parameters, or using a pair of parameters rather than just one parameter.Â And so while the Poisson distribution that I've modeled here in black fixesÂ the variability for a specific mean value, you could have that same mean value butÂ also a large number of different variances using the negative binomial distribution.Â So that's a little bit more flexible.Â

So now you can model the counts as a negative binomial distribution,Â where the mean of that negative binomial distribution is equal to a sample-specificÂ size factor that maybe relates to the number of reads that you've got.Â And then a parameter proportional to the number of fragments,Â which you then model as a function of using that same sort of log link.Â You model the relationship between the thing that you care about, how many countsÂ you get, or how many fragments you have for that particular gene.Â For that particular sample is a linear function of the covariates that youÂ care about.Â So again it's writing it down as a linear model butÂ where the scale is slightly different.Â So this is an example of a generalized linkÂ linear model which you can learn a lot more about.Â For example in this set of lecture notes, they go into a lot of detail aboutÂ generalized linear models, in particular for Poisson regression.Â This is again a huge topic and we've only scratched the surface butÂ I wanted to show you an example of how you can relateÂ count data to covariance that you care about using a regression model even ifÂ it's not the standard linear regression model by using this link function.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.