An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

124 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 2

This week we will cover preprocessing, linear modeling, and batch effects.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

In a previous lecture, we talked about linear regression models

with continuous outcomes and continuous covariates, but in genomics,

you often have either non-continuous outcomes or non-convariates.

In this lecture, we're going to be talking about the case where you have a continuous

outcome, but maybe a not continuous covariate or a categorical covariate or

a factor-level covariate.

So here's a simple example.

Suppose you're trying to relate the levels of cholesterol as measured continuously,

to the genotype of a particular snip, where you either have homozygous dominant

or major allele, heterozygous, or a homozygous minor allele.

So one way that you can model this is to try to model it as a continuous variable.

You could say the number of copies in the minor allele is zero for

the homozygous major allele, one for the heterozygote, and two for

the majoma-zygote minor allele.

In that case you can still fit a linear regression line, just like you did with

the previous lecture, but now you have a specific interpretation.

That is, the difference between the homozygous major allele and

the heterozygote is Beta, and the difference between the heterozygote and

the homozygous minor allele is also Beta, so

you force that difference to be exactly the same for those two cases.

With a categorical variable, you could just code it differently, and

get a different linear regression model.

And this corresponds to basically fitting different means,

so here's a way that you can do that.

Suppose that you defined a variable that's equal to

one if G is not equal to a homozygous major allele, and

it's equal to zero if it is equal a homozygous major allele.

In that case, you have zero of homozygous major, and one if not.

And what you get is a model that looks like this, you're still fitting a linear

regression model, but now that's equivalent to fitting two different means.

The first mean is the mean for the homozygous major allele,

that's equal to Beta0 because this value is equal to zero.

So if this value is equal to 1 then you have a value of 1 here,

and that's equal to Beta0 plus Beta.

So Beta0 plus Beta is equal to the average value for

the heterozygote and the homozygous minor allele.

Similarly, you could fit a recessive model where you actually set a covariant here,

this should be G equals to little a, little a, sorry for that typo.

And so what you can see is in that case, it's equal to one,

when you have homozygous minor allele, and it's equal to zero otherwise.

That's, again,

quivalent to just fitting two mean levels by fitting the regression model.

And so, you're fitting a linear regression but you get a value of Beta0,

and this covariant is zero, so that's the average value of homozygous major allele,

and the heterozygote, and you get a value of Beta0 plus Beta.

You get this value when the covariant is equal to one, so that's the case for

the homozygous minor allele.

So by just changing the covariant definition,

we actually have changed the regression model, and

changed it from one where you're fitting a line to one where you're fitting means.

So the other thing that you can do is go fully flexible,

the two degrees of freedom model, and you can fit two different covariants.

The first covariant is only equal to one when you have a heterozygote.

The other covariant is only equal to one when you have a homozygous minor allele.

And so what does that mean?

That means Beta0 now corresponds to the value when both of these covariants

are zero.

That's homozygous major allele.

So Beta0 is the mean for the homozygous major allele.

Beta0 + Beta Aa is equal to the average value for just the heterozygous.

And similarly Beta aa is the average value, plus Beta0,

is equal to the average value for the homozygous' minor allele.

So now you've fit a fully flexible model where you've actually fit three different

means by including two covariants.

So this is a way that you can use a linear regression model

with categorical covariants to just fit mean levels to the different values.

You just have to be careful about how you define your covariants, and

what that means for the interpretation of the Beta coefficients.

Again this is a whole class on linear regression models, and

I encourage you to go check those things out.

The basic thing to keep in mind when dealing with categorical covariants

is how many levels do you want to fit?

How do you fit those in such a way that it makes sense biologically,

based on the assumptions that you have.

And there are great additional notes again in the edecs course on statistics for

the life sciences.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.