Hi my name is Brian Caffo, and this is Mathematical Biostatistics Bootcamp
Lecture six on Likelihood. In this lecture, we're going to define
what a likelihood is, which is a mathematical construct that is used to
relate data to a population. We are going to talk about how we
interpret likelihoods, talk about plotting them.
And then talk about maximum likelihood, which is a way of using likelihoods to
create estimates. And then we'll talk about likelihood
ratios and, and how to interpret them. Likelihoods arise from a probability
distribution. And a probability distribution is what
we're going to use to connect our data to a population.
So the idea behind this and a lot but not all of statistics follows this rubric is
to assume that the data come from a family of distributions.
And those distributions are indexed by a unknown parameter that represents a useful
summary of the distribution. To give you an example, imagine if you
assume that your data comes from a normal distribution, a so called Gaussian
distribution, so a bell shaped curve. To completely characterize a bell shaped
curve, all you need is its mean and its variance.
So the probability distribution, the Gaussian distribution or the bell shaped
curve has two unknown parameters, the mean and the variance.
And then the goal is to use the data to infer the mean and the variance.
And the ideas that the mean and variance from the Gaussian distribution are unknown
population parameters because the Gaussian distribution is our model for the
population. And the data or sample parameters is what
we are going to use to estimate the unknown parameters.
So the, the nice part about this approach other than quite a bit of other directions
and statistics is that the sample mean, the sample variance, these are all
estimators with the population model, you actually have estimands.
The sample mean is actually estimating something.
It's not just a statement about the data. It's an estimate of the population and
that's what we're going to be talking today and we're going to talk about a
particular way of approaching estimation and summarizing evidence in the data when
you assume a probability distribution using Likelihood.
Likelihood is a mathematical function, as a particular definition.
And it's just the joint density of the data evaluated, as a function of the
parameters with the data fixed and we'll go through an example.
Before we go through our example, I want to talk about what it is the likelihoods
are attempting to accomplish, and how we might interpret them.
So I'm going to put forward a particular theory of how likelihoods can be
interpreted and how they can be used and I guess I should stipulate that maybe not
everyone agrees with this theory. But the theory I'm going to put forward is
that, ratios of likelihood values measure relative evidence of one value of an
unknown parameter relative to another. So if you evaluate the likelihood with the
parameter of a specific value you get the number, and then you take the ratio with
the likelihood value, you get a different number.
That ratio, if it's bigger than one, it's supporting the hypothesized value of the
parameter in the numerator. If it's less than one, it is supporting
the hypothesis value of the parameter in the denominator.
So this is a somewhat controversial interpretation of likelihoods, but it's
the one I'm going to put forward. The second point is similarly
controversial, though there is a mathematically correct proof that at least
it motivates to, it actually doesn't prove to, but the, the statement I'm making too
is that given a statistical model so given a probability model and observe data,
there is a theorem called the likelihood principle that says all of the relevant
information contained in the data regarding the unknown parameter is
contained in the likelihood. Now, the likelihood principle has a
mathematically correct proof but not everyone technically agrees on its
applicability and its interpretation but nonetheless I'm going to put this forward
as a way that in this class we're going to interpret likelihood so that once you
collect the data, if you assume a statistical model, then the likelihood is
going to contain all of the relevant information.
It's interesting that this point two has very far reaching consequences to the
field of statistics if you believe it. Things like P values, and much of
hypothesis testing, and other staples of statistics become questionable if you take
point two as being true. So, you know for today's lecture, we're
going to take it as being true. And we'll talk a little bit about, maybe,
some of the controversy associated with it.
Probably from a much more practical point of view is point three which says, and we
already know this, but let's state it in the term of likelihood.
So when we have a bunch of independent data points, Xi then the joint density is
the product of the individual densities. So, equivalently as we said that the
likelihood is nothing other than the density of evaluators of the function of
the parameter, it's also true that their likelihoods multiply, so independence
makes things multiply. It makes the joint density multiply, it
makes the likelihood multiply. So, I had summarize that here in the
statement that is the, the likelihood of the parameters given all of the Xs is
simply the product of the individual likelihoods.
That last point I'd like to make on this slide is that, especially points one and
two, the likelihood assumes these interpretations of likelihoods.
One negative aspect of them is that you have to actually have the statistical
model specified correctly, and of course we don't know the statistical model really
ever. If we assume that our data is Gaussian,
that's an assumption. It's not generally something we know.
Maybe in some rare cases like in, for example, radioactive decay.
There are some physical theory that suggests that, that data is Poisson for
example. But in most cases, we don't actually know
that the statistical family is a correct representation of the mechanism that would
generate data, if we were to draw from the population.
So, I think the way in which people still rationalize using likelihood based
inference in these cases is that they say, well given that we assume this is the
statistical model, then we will adhere to the use of the likelihood to summarize the
new evidence in the data. Let's go through a specific example.
One of the more important examples, and it's very illustrative, so let's do it.
Consider just flipping a coin, but this coin, let's say it's a, an oddly shaped
coin. Maybe it's a little bent or something like
that. So you don't actually know what's the
probability of a head. Let's label that probability of a head as
theta. And then recall that the mass function for
an individual coin flip is theta to the x, one minus theta to the one minus x.
Here in this case the theta has to be between zero and one.
So if X is zero its a tails and X is one, its a head.
So if we flip the coin and the result is a head, then the likelihood is simply the
mass function with the one plugged in, right?
So in this case we get theta to the one, one minus theta to the one minus one which
works out to be theta. So the likelihood function is the line,
theta where theta takes values between zero and one.
And if you accept our laws of likelihood and likelihood principle and then
interpretation of likelihood that I sort of outlined in the previous page, then
this says that consider two hypothesis. The hypothesis that the coin's true
success probability is 50%, .5 versus the hypothesis that the coin's true success
probability is .25. In the light of the data, right?
The one head that we flipped and obtained, the question is what is the relative
evidence supporting the hypothesis that the coin is fair, .5 to the coin is unfair
with the specific success probability of.25, we would take the likelihood ratio
which is then .5 divided by .25, which works out to be two.
So if you accept our interpretation of likelihoods, this would say there is twice
as much evidence supporting the hypothesis that theta equals .5 to the hypothesis
that theta equals .25. So that is the idea behind using
likelihoods for the analysis of data. Now let's just extend this example.
So, suppose we flip our coin from the previous example but instead of flipping
it just once we get the sequence one, zero, one, one.
I have kind of a funny notation here. I am going to write script L as the
likelihood and L is a function of theta. But it depends on the data that we
actually observe, one, zero, one, one and so we're assuming our coin flips are
independent. And so what happens with like this?
Will you take the product? So, here I have the first coin flip theta,
to the one, one minus theta to the one minus one.
Here I have the second coin flip theta to the zero, one minus theta to the one minus
zero and so on. So, I take the product of all of those and
you get theta cubed, to the one minus theta, raised to the first power.
And that's the likelihood for this particular configuration of ones and
zeroes from four coin flips. Notice, however, that the order of the 1s
and the 0s, does it matter? Regardless of the order as long as we got
three ones and one tail, the likelihood was going to be equivalent.
It was going to give you theta to the three, one minus theta to the one.
So, that is a property of likelihoods. It's illustrating that, if you have a
coin, the particular configurations of zeros and ones doesn't matter.
All of the relevant information about the primary figure is contained only in the
fact we got a specific number of heads and a specific number of tails.
It doesn't depend on the order whatsoever. And in this case, because we know how many
coin flips we have, all we need to know is the specific number of heads.
So instead of writing likelihood of theta depending on 1,0,1,1, we might write it as
likelihood of theta depending on getting one tail and three heads because it's the
same thing. It's, the order is irrelevant.
This by the way raises the idea of so called sufficiency in this case, you know
the number of heads in the total coin flips is sufficient for making inferences
about theta. You don't need to know the data actually,
all you need to know is the total number of heads and the number of coin flips.
So often, that total number of heads, you know, conditioning on the fact that we
know the total number of coin flips, is called a sufficient statistic.
It's saying that there's a reduction in the data, to make inferences about the
parameter, you only need to know a summary of it, a function of it.
And in this case a function that we need to know is the sum, total number of heads.
So let's do a likelihood calculation again and let's take the likelihood supporting
the coin is fair that theta is .5 and divided by the likelihood assuming that
the coin is unfair specifically with a 25 percent chance of heads and we get the
ratio of 5.33. So in other words, there's over five times
as much evidence supporting the hypothesis, that theta is.5 over the
hypothesis that beta is .25. Now, that relative values of likelihoods
measure evidence. Well, that's useful but we're not
particularly interested in say, .25, I mean the .5 is kinda interesting cuz the
coin is fair. But you know, most of the other points,
you know, we're not interested in .25 anymore whether we're more interested in
.24 and so on. So we'd like a way to consider likely at
ratios of all values of the, parameter theta.
And this is just simply a likelihood plot, that simply plots theta by the likelihood
value. And remember that likelihoods are really
interpreted in terms of relative evidence. So it's the fact that the ratio of the
likelihood of .5 to the likelihood of .25 is five is what's saying that we have five
times as much evidence. So it actually doesn't matter.
Constants that don't depend on theta don't matter in the likelihood, right?
Because when you take the ratio of, if, if there's a constant that doesn't depend on
theta. If it's in the numerator and the
denominator it'll just cancel out. The likelihood in their interpretation
should be in variant to constants that are not a function of the parameter.
So because of that the raw absolute value of the likelihood isn't all together that
informative, so we need to pick a rule for kind of normalizing it, and so why don't
we just divide it by it's maximum value so that it's height is one.
And that seems to be a pretty reasonable rule, and it helps with interpretations, I
think. And again I just want to reiterate this
last point. Because likelihoods you know, if you're
going to buy into this sort of likelihood paradigm of interpreting likelihoods,
everyone agrees that they measure relative evidence rather than absolute evidence.
So, you know dividing the curve by its maximum value or any value, you know, it
doesn't change it's interpretation. It's actually an interesting question I
might add to try and think of could someone create an absolute measure of
evidence in statistics and I'm not aware of any.
But, so we'll have to stick to relative measures for right now.