0:00

We begin by discussing this simplest learning problem, which is that of

estimating a set of parameters. And the key building block, for parameter

estimation is what's called maximum likelihood estimation, so we're first

going to define that and then build on that in the context of more interesting

problems. So, let's go back to the simplest

possible learning problem where we are generating data from a biased coin which

is a Bernoulli Distribution and the probability that x takes the value one is

some, defined by some parameter data, and the probability that it takes the value

zero is therefore one minus data. And let's imagine that we're given access

to a data set D, which is a bunch of instances x1 up to xM that are sampled

IID from the distribution P. Let's remind ourselves what it means for,

what the IID means. So first, the first I means that the tosses are independent of

each other. And the ID means that they're identically

distributed which means that they're all sampled from P.

And our goal is to take D and reconstruct data.

So, let's think about this as a probablistic graphical model, in fact,

it's probablistic graphical model that we've seen before when we first talked

about Crate models. So here, we have a parameter, theta, and

a bunch of, replicates of the same experiment, which are these x's over

here. Which our sample from theta

independently. And we can see that if you think about

this as a graphical model as written over here, we can see that each xi depends on

theta. And that they're conditionally

independent of each other once theta is known.

So, they're independent and identically distributed, given theta.

And if you think about this as a graphical model, well, it better have

CPDs, and so the CPD for the M toss, given theta is the part, is that the x

takes the value x1 with probability, theta,

and the probability, the value x0, with probability one minus theta,

and that's a completely legitimate CPD. It's a CPD whose parent is the value

theta, which is the parent of the x variable.

It's a continuous variable, but that's, we've done those before.

2:43

So, now that we've specified this as a probablistic graphical model, we can go

back and think about how maximum likelihood estimation would work. And so,

the goal is to find the, parameter of state L, which is in interval zero, one

that predicts the D well. So, the quality of our, of a particular

parameter theta is how well it predicts D,

which can also be viewed as how plausible is D given a particular value of theta.

So, let's try and understand what that means.

We can ask, what is the probability on the D that we saw, given a particular

value of theta? And since the value, the, the tosses, or

the instantiations xi are conditionally independent given given theta, we can

write this as the product over all of the m instances of the probability of the nth

instance given theta. And so now, let's think about what that

is in the context of a concrete example. Let's assume that we have tossed this

coin five times and we have three heads and two tails or three ones and two

zeros. So, if we actually write down what this

probability function looks like, we can see that that's going to be

4:13

the probability of getting head given theta times the probability of getting

tails, given theta times the probability of the second tail and the probability of

heads times another probability of heads. And the probability of the first head

given theta is theta one minus theta for the tail, one minus theta, theta, theta.

Or theta to the power of three, one minus theta to the power of two and that is

exactly the function that is drawn over here.

Now, if we're looking for the theta that predicts D, well, we just defined that as

the theta that maximizes this function, if you draw a line from this maximum down

to the bottom, you can see that this function is maximized at the point near

point six, which, not surprisingly, is the same as the three heads that we saw

over the five total pauses. But generalizing that, let's assume that

we have observed in this context, MH heads and MT tails, and we want to find

the theta that maximizes the likelihood and just as, in the simple analysis, in

the simple example that we had, this likelihood function is going to have

theta appearing MH times and one minus theta appearing MT times.

And so, that's going give us a likelihood function that looks just like the

likelihood function that we saw on the previous line.

And if we think about how we can go about maximizing such a function, then usually

we take the following steps. First, it's convenient to think about not

the likelihood but rather what's called the log likelihood, denoted by a small l,

which is simply the log of this expectant over here and that has the benefits of

turning a product into a summation. And now that we, so that gives us a simpler

optimization objective but it's the one that has the exact same maximum.

And we can now go ahead and do the standard way of maximizing a function

like this, which is differentiating the log-likelihood and solving for theta.

And that's going to give us an optimum which is exactly as we would expect, that

is, it's the fraction of heads among the total number of coin tosses and that's

the maximum of this log-likelihood function and therefore, the likelihood is

well. Now, an important notion in the context

of maximum likelihood estimation is also important in, in when we develop it

further is the notion of a sufficient statistic.

So, when we computed theta in the coin toss example, we defined the likelihood

function as an expression that has this form.

And notice this expression didn't care about the order in which the heads and

the tails came up. It only cared about the number of heads

and the number of tails. And that was sufficient in order to

defined a likelihood function, therefore, sufficient for maximizing the likelihood

function. And so, in this case, MH and MT are what

known, are what's known as sufficient statistics for this particular estimation

problem because they suffice in order to understand the likelihood function and

therefore to optimize it. So, why generally a function of the data

set is the sufficient statistic? if it's a function from instances to a

8:02

vector in Rk if, it has the following property, if this function s of D, has

the following property. For any two data sets, D and D prime, and any possible

parameters theta, we have that the following is true.

If s of D is equal to s of D prime, then the likelihood function for those

two data sets is the same. So, what's the s of D?

s of D is the sum of the sufficient statistics for all of the instances in D.

So, let's make this a little bit more crisp.

What we're trying to do is we're trying to look at a bunch of data sets,

for example, all sequences of coin tosses.

And we're trying to summarize. data sets using a smaller, more compact

notion, which is statistics that throw out aspects of the data that are not

necessary for for reconstructing its likelihood function.

So, let's look at the multinomial example, which is the generalization of

the Bernoulli example that we had before. So, assume that what we have is a set of

measurements of a variable x, that takes on k possible values.

Then, in this case, the sufficient statisticsit, just like before, whereas,

we had the same number of heads and the same number of tails.

Here, we have the number of values, the number of times each of the k values came

up. So, we have M1, M2, up to Mk.

So, for example, if you're looking for a sufficient, for sufficient statistics for

a six-sided die · , you're going to have M1 up to M6 representing the number of

times that the die came up one up to the, and number of times it came up two,

three, four, five, and six. And so, what is the sufficient statistic

function in this case? It's remember, it's a two-fold of

dimension, of dimension k, which are the different numbers of values, which are

the number of different values of the variable.

And the sufficient statistics for the value xi is one where we have a one only

10:34

in the ith position, and zero everywhere else.

So, if we sum up s of xM over all instances in their data set, we're going

to get a vector, where, you only get a one contribution when the instance, when

the M, when the Mth data instance comes up that particular value, and so that's

going to be M1, M2, Mk. And this is sufficient statistic because

the likelihood function then can be reconstructed as a product of theta i,

Mi, where this theta i here is the parameter for x equals little xi.

11:30

Let's look at a different example. Let's look at the sufficient statistic

for a Gaussian distribution. So, as a reminder, this is a

one-dimensional Gaussian distribution that has two parameters, mu, which is the

mean, and sigma squared, which is the variance.

And that can be written as, in the following form which is one that you've

seen before. And we can rewrite the exponent in in the

following way, we can basically blow out the quadratic term in the exponent and

you end up with the likelihood function that has -x squared times a term plus x

times the term minus a constant term. And the sufficient statistics for

Gaussian can now be seen to be x squared, x and one.

Because when we multiply P of X for multiple occurrences of X, we're going to

end up adding up the x squared for the different, for the different data cases,

adding up the x's for the different data cases and then this is just going to be

the number of data cases. So, the s of data set P is going to be

the sum over M, X of M squared, the sum over M with M and this [UNKNOWN].

And from that, we can reconstruct the likelihood function.

13:13

How do we now perform maximum likelihood estimation? So, as we talked about, we

want to choose theta so as to maximize the likelihood function and if we just go

ahead and optimize the functions that we've seen on previous slide for

multinomial, that maximum likelihood estimation turns out to be simply the

fraction. So, for the value xi, it's going to be

the fraction of xi in theta, which again, is a perfectly, very natural

estimation to use. for a Gaussian, we end up with the

following as the maximum likelihood estimate.

The mean is the empirical mean. in the distribution.

That is the average over all of the data cases and the standard deviation is the

empirical standard deviation. So to summarize, maximum likelihood

estimation is a very simple principle for selecting among a set of parameters given

data set D. We can compute that maximum likely

destination by summarizing a data set in terms of sufficient statistics, which are

typically considerably more concise than the original data set D.

And so, that provides us with a computationally efficient way of

summarizing a data set so as to the estimation.

And it turns out that for many parametric distributions that we care about, the

maximum likehood estimation has an easy to compute closed form solution given the

sufficient statistics.