An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

94 ratings

Johns Hopkins University

94 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Linear regression andÂ its variants are probably the most widely used modeling tool in genomics.Â So I'm going to do a quick lecture on how do we do this in R.Â So again, I'm going to set up my parameters like I like them to be forÂ my plots when doing this analysis.Â And then I'm going to load the packages for this lecture, andÂ in particular the broom package is the one thatÂ we're going to be taking advantage of to make things pretty.Â So here I'm going to load the body map data set.Â So the body map data set can be loaded from the recount URL that I gave there andÂ then I'm going to define the data setÂ

variables to be a little bit easier to see here.Â I'm going to extract the expression data, the venus type data and the feature data.Â So the first thing that I'm going to do is I'm going to convert the edata intoÂ a matrix.Â And so I'm going to do that soÂ that it makes it easier to deal with the values on a numeric scale.Â And the first thingÂ that you can do to fit a linear model is just to use the lm command.Â And I can fit that gene by gene by taking the first gene of the expression data.Â So remember the genes are in the rows, andÂ then I can relate that using the tilde operator to any variables that I want.Â In this case I'm going to say, what's the relationship between expression and age?Â And so now that fits a linear model object, which you can type lm1 andÂ you can see that it's been fit here in that object.Â If you do tidy(lm1), this gives you some information.Â So what we have here is now a tidied version of that data set, and soÂ it gives you the estimates, the standard errors, the statistics, and the p-values.Â So this estimate is at age zero, what's the average count.Â So it's about 2,000 and then forÂ each additional year we go down a count of minus 0.23 andÂ then this tells us the standard area of the statistic in the p-value.Â

age versus the first gene, so you can see that there.Â And then I might want to add the linear model fit to this data set.Â And so one way that you can do that is to just do abline(lm1), andÂ then tell it what color you want it to be and how wide you want the line to be.Â So now there's the fitted values, so that the line be fit through that data set.Â

So then the tricky thing is that's pretty easy to do when you haveÂ a quantitative covariate that you're modeling, butÂ if you look at the gender covariate that's a factor variable.Â And so sometimes you have about eight males andÂ I'm going to make it eight females and so if I want to look at the data setÂ here I can do a box plot of that data versus the gender of each sample.Â And so then I can overlay the points on top of it.Â And so I thing you might want to do is a linear model to relateÂ this categorical variable to the out come of expression.Â And so here you can see that there's a relationship between gender andÂ this genes expression.Â It's not made particularly very strong.Â Given these aren't that far apart and there's quite a bit of variability,Â but how do we quantify that.Â So one way to do that is by defining dummy variables, and soÂ dummy variables are basically variables that equal one if something is true.Â So in this case if the gender is equal to male then you have a dummy variableÂ that's true when you're equal to male and false otherwise.Â So if you multiply that by one and R it'll give you ones and zeros.Â You can see there are some NA values here for the gender variable.Â You could similarly do that.Â You could just define the opposite way.Â You can define a dummy variable that's only true ifÂ the gender is equal to female.Â So then R actually does this for you directly.Â So you can fit a linear model relating the first genesÂ expression to gender, and it automatically turns that into a dummy variable.Â So if we tidy that up, we can see thatÂ we get a coefficient here that's the change in counts if you're a male.Â So the average change in count is minus 105 for a male compared to a female forÂ that gene.Â

So what's going on under the hood here, R is actually if you canÂ look at the model matrix function, is actually what's happening under the hood.Â It's taking that variable gender which was just a factor variable, andÂ it's turning it into ones and zeros.Â One when you're male and zero when you're a female.Â And then it's using that quantitative variable to fit the actualÂ regression model just like it was done before.Â You can actually do this with even more categories.Â So if you look at the tissue type variable,Â it has a large number of categories here, and soÂ you can define multiple dummy variables, one for each tissue type.Â So if you say, is the tissue type equal to adipose, then only forÂ the first sample is that true.Â You could also say, is the tissue type equal to adrenal.Â So that's only true for the second sample and so forth.Â So you could define a whole series of dummy variables, andÂ R will actually do this for you.Â So if I relate the first genes expression to this factor variable that has multipleÂ levels, R is going to actually fit coefficients for every one of these.Â So the baseline intercept term is the one that didn't get if in the model.Â So you have to be a little bit clever here, you have to figure out which ofÂ these tissue types doesn't actually appear in the list of coefficients.Â So it turns out its usually the first one, so adipose is missing, soÂ this is sort of the average count in adipose.Â And then you can interpret this estimate as being the difference inÂ average count between adipose and adrenal.Â So if you add the two together that's the average count for the adrenal samples.Â So you can fit this more complicated example.Â The other thing you can do is adjust for variables andÂ so we talked about adjusting for variables.Â And the way that you can do that is basically by expanding the model formula.Â So here going to add,Â we're going to add a model that has both an age term and a gender term.Â

And so then, if I tidy up that data set,Â I can see that there's estimates here for age and for gender.Â And so for example, this estimate for age is the difference in counts forÂ one extra year of age when you hold gender fixed.Â So either within the males or within the females.Â

So then you can actually fit interaction models if you want to be able toÂ get a little bit more complicated, because this fits the same relationship forÂ the males and the females.Â And so to do that in R, you basically use the multiplying command in the formula.Â So you go, if you want to fit interaction term,Â pdata age times pdata gender, andÂ if you tidy that model up, you can see now there's a term for age and a term forÂ gender and a term for age interacting with gender.Â And so what does this mean it means it fits one line if you're a male then youÂ have the age, the age effect is basically minus 13 minus 18.Â But if you're a female this term goes to zero because the gender male term is zeroÂ So this is the line, the slope of the line,Â minus 13 with respect to age for females.Â So you get two different lines with two different slopes by fittingÂ an interaction term.Â These get a little trick soÂ you've got to be a little careful when interpreting them.Â The next thing that you can do is you can basically overlay these points again,Â on top of the graph.Â So here I'm going to fit a linear model relating the sixth genes data to the age,Â and I can plot that gene and I can see that there's an outlier there.Â So if I add the line, you can see that it'sÂ

flat, but you can actually do this when youÂ actually subtract out values that have outliers in different places.Â So in this case the outlier doesn't necessarily affect the line fit,Â it doesn't look like it's pulling it one direction or another.Â But we can check out another case where it might.Â And so to do that I'm going to instead of plotting that variable versus age I'm justÂ going to change it so that I plot it versus the index of where it appears.Â And so when I do that if I plot the data on that scale so if I plot the indexÂ versus this value, I can see that the outliers right over here at the end.Â So now if I fit a relationship between index and the six genes expression, IÂ

can see that, so I've got this plot with the outlier at the end.Â I can see that the, in the overlay the model fit,Â

I can see that it's being pulled up a little bit by this outlier.Â So you can see that, instead of having a flat line,Â there's a line that's sort of being pulled up by the outlier over there.Â So, if I remove that outlier, suppose I just take that one outlier out andÂ I refit the model, so now here I'm doing the same thing,Â I'm fitting the same model, but see I'm subtracting the 19th data point.Â

From both the predictor and from the outcome, and soÂ then if I make that same line,Â but now for the model field without the outlier, you can see that it's flat.Â So you can see that basically that outlier is what's driving this whole curve.Â So you gotta be very careful when you see outliers on these data sets.Â I'm adding a legend here to show which one is the curve with the outlier andÂ which one's without the outlier.Â But in general you can see that adding outliers,Â especially near the ends of curves can really change the values that you get.Â If I look at the residuals from these two, so if I make a plot side by side like inÂ part(mfrowc(1,2)), I can make a histograms ofÂ the residuals, for each of the two models.Â The one without the outlier and one with the outlier.Â You can see that one with the outlier has a huge outline residuals.Â Well, that's not the way you can diagnose if you have these residuals.Â

So the other thing you can do is, you can, if you want to see these again,Â these distributions, because their counts aren't necessarily centered at zero,Â they don't necessarily have the distribution that we'd like.Â And so one thing that we can do is you can actually do the transformationÂ that we've been doing.Â We can do log two transform data, andÂ you can fit a linear aggression model to the transform data.Â And then if you look at the residuals, compared to the residualsÂ from one of these other examples, they actually look a little bit better.Â

And so the other thing that you can do is you can basically fit these modelsÂ with as many coefficients as you want and R will fail gracefully butÂ sometimes that's not necessarily a good thing.Â So here's an example where I can take that sameÂ

And then if I tidy up that model, you can see many of the coefficientsÂ are not defined and that's the standard errors and statistics are not defined andÂ that's because you have too many covariance here.Â So the dimension of this model that we're fitting.Â So if you look at the dimension of the model matrix that you're fitting.Â

It's 16 by 18.Â So you're basically fitting too many data points.Â There are only 16 data points,Â but you have 18 parameters that you're trying to fit.Â And so R will fill gracefully, it will actually come up with estimates,Â though they're not very reliable given that we basically soaked up allÂ the data just to estimate this linear regression model.Â So you have to be very careful when you're including covariants to makeÂ sure you haven't included covariants with too many different levels.Â

So then the last thing that you can do here is if you fit a linearÂ model relating a variable to age, you can then plot the residuals.Â And you can see those residuals colored by different variables.Â So, here, I'm going to do this.Â

Plot on the scale where you can see it.Â One thing that a diagnostic that people often use when doing these things isÂ a fit, a linear regression model, then they plot the residuals andÂ here they're going to color the residuals by the tissue type.Â And so you can see for example this tissue type has a residual that is very high.Â So this can be often a very useful diagnostic is to plot the residualsÂ colored by different variables.Â So you can identify variables that seem to be associated with the residuals, andÂ they might be a variable that you want to actually include in your model after all.Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.