An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

120 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

The most common statistical modeling technique used across almost every area of

statistical genomics and genetics is linear models.

So, I'm going to talk a little bit about what linear models are, but

the basic idea is to fit a best line relating two variables.

So most common thing that you would want to do when you're doing genomics is

to take some genomic measurement and associate it with some outcome or

some phenotype or some technical artifact.

So the idea you're trying to relate these two variables and

suppose the data are written as Y and X,

then you can imagine writing a line as b0 + b1 times X and

you're trying to minimize the distance between the observed data and

the model line as a function of X.

In general, you can always fit a line through a set of data.

The question is whether that line is a good fit or not,

whether it was a good idea or not to do that linear regression.

So here's an example of a linear regression that's a really old idea.

So basically, what they did was they took a bunch of measurements on

people that were on the parent heights and the children heights and

they may have plotted those two things.

And then you can draw a line that relates the relationship

between the average height of the parents and the children and

see that there's a relationship between those two variables.

So, this is actually a really old idea but it still works pretty well.

It turns out by Victorian methods by doing that plotting the average height of

the parents versus the average height of the children.

You actually explained more variability that all of the genomic sort of

variance that we've collected can explain.

So basically, linear regression is one of these most powerful techniques.

And so I'm going to show you a little example of this,

using that same data that was collected from that Galton example.

So here, I've plotted the distribution of children's heights.

And here, I've plotted the distribution of parents' height.

So again, for each parent, we take the average of the mother and

the father to get the parent height.

So if I take that information and I want to relate those two things to each

other, there's a couple of steps that you could do.

First, you could just look at the children's height.

So suppose you wanted to get an estimate for the average children's height.

Well, you could do that by just minimizing the distance between each measurement and

some number.

If you take that whatever number that is and you minimize this distance,

it turns out that the best minimizer is just the average children's height.

So the other thing that you can do is you could plot the parents' height

versus the children's height, okay?

And so here, I've used a jitter to make sure, because some of the parents'

heights and children's heights are measured exactly the same and

you want to be able to see all of the points here.

So now if I want to find the relationship between these two things,

one way that I could do that is I could look at different subsets of the data.

So I could look at just this part of the parents' heights, so

parents' heights between 64 and 66, and I could say,

what's the average children's height for that value.

Then I could do the same for other values.

I could say, what's the average children's height for a parent height between 70 and

72 and so forth.

Another way that I could do that is I could try to fit a line through these

relationships.

So I fit the best fitting line.

Now by the best fitting line, again, I mean the line that

minimizes the relationship between the observed data and the line that we fit.

So, here's an equation for a line.

If you take the children's height,

there's an intercept term plus term that's related to the parents' height, okay?

And so this is the intercept and this is the slope.

This is we're back to your algebra days.

So if you make that plot here, again, you see that the line actually isn't perfect.

Not all of these dots fall exactly on the line, so

the line doesn't perfectly describe the data.

Another way to do this is you can just expand the equation a little bit.

We'll say that the children's height is equal to that equation from the line that

we had before plus some random noise.

That random noise is everything we didn't measure.

Some people think of it as sampling variability, that's one component of it.

But it's also any other sort of bias or

variation that you didn't measure in your data stack.

So the line that we're going to fit is the one that minimizes the distance between C

and the equation from the line that we care about.

Since we don't know the random variation in general,

we're just going to minimize the relationship between C and the line fit.

So if we do that, we get a line that fits like this, okay.

And so the first thing that you need to do is you need to ask yourself, okay,

we fit a line in this data but does it make any sense?

And one way that you can do that is by taking the residuals.

So what's that mean?

That means take this line and calculate the distance between every data point and

the actual line itself and make a plot of those.

So on average those residuals are going to come out to average to be zero.

But then they're going to be, there's going to be a distribution for

every different parent height.

So for a parent heights between 64 and 66, this right here is the distribution of

the residuals, between 70 and 72, here's the distribution of residuals.

Ideally, you would like to see that there's a similar variability at

every different parent height.

And you would like to see no big outliers and you would like to see them centered,

kind of nicely around zero.

That means that the line is fitting pretty well.

There's actually a whole set of residual diagnostics that you can do to

check to make sure that the lines fitting well.

But the things that you're definitely looking for are outliers,

distributions that are skewed and you're looking for any clusters of points that

appear to cluster, say away from these line when you're looking at the residuals.

You can color it by, color these dots by a whole bunch of other

different variables and see if there's a diagnostic for

why maybe the linear regression isn't working very well.

Keep in mind again that you can always fit a line but

the line doesn't always make sense.

Here again is Anscombe's quartet, so all of these lines are the same exact

line with the same exact parameters and significance and everything else.

So you get the exact same intercept and slope estimates.

But here for example, you see a curvilinear relationship.

Here you see a crazy outlier and again a crazy outlier right here.

So what you're looking for when you're fitting a line,

this is what you're sort of expecting to see, a sort of a scatter plot of points,

then it's a cloud of points like that.

If you see more specific relationships, you know that you have to do a more

specific model, a model that either accounts for quadratic variation or

a model that accounts for outliers.

To account for the fact that it's not really just a linear regression line that

you're actually supposed to be fitting there.

So this is actually a whole class.

I've done a lecture on it.

I'll do a couple more lectures but

it's sort of a very quick overview of regression models.

If you take the regressions model course in the John Hopkins Data Science

Specialization, you'll cover a whole bunch of diagnostics and ideas.

We've covered the basics here so you'll know what to fit,

but the diagnostics require a lot more intuition and thinking.

The basic thing to keep in mind, though, is does the line fit?

Is it make sense?

Not just does it fit statistically but does it make sense to fit a line?

And then there are great additional notes in this book here and

in the corresponding class on linear models and

the class on statistics for the life sciences on edX.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.