0:00

We begin by discussing this simplest learning problem, which is that of

ย estimating a set of parameters. And the key building block, for parameter

ย estimation is what's called maximum likelihood estimation, so we're first

ย going to define that and then build on that in the context of more interesting

ย problems. So, let's go back to the simplest

ย possible learning problem where we are generating data from a biased coin which

ย is a Bernoulli Distribution and the probability that x takes the value one is

ย some, defined by some parameter data, and the probability that it takes the value

ย zero is therefore one minus data. And let's imagine that we're given access

ย to a data set D, which is a bunch of instances x1 up to xM that are sampled

ย IID from the distribution P. Let's remind ourselves what it means for,

ย what the IID means. So first, the first I means that the tosses are independent of

ย each other. And the ID means that they're identically

ย distributed which means that they're all sampled from P.

ย And our goal is to take D and reconstruct data.

ย So, let's think about this as a probablistic graphical model, in fact,

ย it's probablistic graphical model that we've seen before when we first talked

ย about Crate models. So here, we have a parameter, theta, and

ย a bunch of, replicates of the same experiment, which are these x's over

ย here. Which our sample from theta

ย independently. And we can see that if you think about

ย this as a graphical model as written over here, we can see that each xi depends on

ย theta. And that they're conditionally

ย independent of each other once theta is known.

ย So, they're independent and identically distributed, given theta.

ย And if you think about this as a graphical model, well, it better have

ย CPDs, and so the CPD for the M toss, given theta is the part, is that the x

ย takes the value x1 with probability, theta,

ย and the probability, the value x0, with probability one minus theta,

ย and that's a completely legitimate CPD. It's a CPD whose parent is the value

ย theta, which is the parent of the x variable.

ย It's a continuous variable, but that's, we've done those before.

ย 2:43

So, now that we've specified this as a probablistic graphical model, we can go

ย back and think about how maximum likelihood estimation would work. And so,

ย the goal is to find the, parameter of state L, which is in interval zero, one

ย that predicts the D well. So, the quality of our, of a particular

ย parameter theta is how well it predicts D,

ย which can also be viewed as how plausible is D given a particular value of theta.

ย So, let's try and understand what that means.

ย We can ask, what is the probability on the D that we saw, given a particular

ย value of theta? And since the value, the, the tosses, or

ย the instantiations xi are conditionally independent given given theta, we can

ย write this as the product over all of the m instances of the probability of the nth

ย instance given theta. And so now, let's think about what that

ย is in the context of a concrete example. Let's assume that we have tossed this

ย coin five times and we have three heads and two tails or three ones and two

ย zeros. So, if we actually write down what this

ย probability function looks like, we can see that that's going to be

ย 4:13

the probability of getting head given theta times the probability of getting

ย tails, given theta times the probability of the second tail and the probability of

ย heads times another probability of heads. And the probability of the first head

ย given theta is theta one minus theta for the tail, one minus theta, theta, theta.

ย Or theta to the power of three, one minus theta to the power of two and that is

ย exactly the function that is drawn over here.

ย Now, if we're looking for the theta that predicts D, well, we just defined that as

ย the theta that maximizes this function, if you draw a line from this maximum down

ย to the bottom, you can see that this function is maximized at the point near

ย point six, which, not surprisingly, is the same as the three heads that we saw

ย over the five total pauses. But generalizing that, let's assume that

ย we have observed in this context, MH heads and MT tails, and we want to find

ย the theta that maximizes the likelihood and just as, in the simple analysis, in

ย the simple example that we had, this likelihood function is going to have

ย theta appearing MH times and one minus theta appearing MT times.

ย And so, that's going give us a likelihood function that looks just like the

ย likelihood function that we saw on the previous line.

ย And if we think about how we can go about maximizing such a function, then usually

ย we take the following steps. First, it's convenient to think about not

ย the likelihood but rather what's called the log likelihood, denoted by a small l,

ย which is simply the log of this expectant over here and that has the benefits of

ย turning a product into a summation. And now that we, so that gives us a simpler

ย optimization objective but it's the one that has the exact same maximum.

ย And we can now go ahead and do the standard way of maximizing a function

ย like this, which is differentiating the log-likelihood and solving for theta.

ย And that's going to give us an optimum which is exactly as we would expect, that

ย is, it's the fraction of heads among the total number of coin tosses and that's

ย the maximum of this log-likelihood function and therefore, the likelihood is

ย well. Now, an important notion in the context

ย of maximum likelihood estimation is also important in, in when we develop it

ย further is the notion of a sufficient statistic.

ย So, when we computed theta in the coin toss example, we defined the likelihood

ย function as an expression that has this form.

ย And notice this expression didn't care about the order in which the heads and

ย the tails came up. It only cared about the number of heads

ย and the number of tails. And that was sufficient in order to

ย defined a likelihood function, therefore, sufficient for maximizing the likelihood

ย function. And so, in this case, MH and MT are what

ย known, are what's known as sufficient statistics for this particular estimation

ย problem because they suffice in order to understand the likelihood function and

ย therefore to optimize it. So, why generally a function of the data

ย set is the sufficient statistic? if it's a function from instances to a

ย 8:02

vector in Rk if, it has the following property, if this function s of D, has

ย the following property. For any two data sets, D and D prime, and any possible

ย parameters theta, we have that the following is true.

ย If s of D is equal to s of D prime, then the likelihood function for those

ย two data sets is the same. So, what's the s of D?

ย s of D is the sum of the sufficient statistics for all of the instances in D.

ย So, let's make this a little bit more crisp.

ย What we're trying to do is we're trying to look at a bunch of data sets,

ย for example, all sequences of coin tosses.

ย And we're trying to summarize. data sets using a smaller, more compact

ย notion, which is statistics that throw out aspects of the data that are not

ย necessary for for reconstructing its likelihood function.

ย So, let's look at the multinomial example, which is the generalization of

ย the Bernoulli example that we had before. So, assume that what we have is a set of

ย measurements of a variable x, that takes on k possible values.

ย Then, in this case, the sufficient statisticsit, just like before, whereas,

ย we had the same number of heads and the same number of tails.

ย Here, we have the number of values, the number of times each of the k values came

ย up. So, we have M1, M2, up to Mk.

ย So, for example, if you're looking for a sufficient, for sufficient statistics for

ย a six-sided die ยทย , you're going to have M1 up to M6 representing the number of

ย times that the die came up one up to the, and number of times it came up two,

ย three, four, five, and six. And so, what is the sufficient statistic

ย function in this case? It's remember, it's a two-fold of

ย dimension, of dimension k, which are the different numbers of values, which are

ย the number of different values of the variable.

ย And the sufficient statistics for the value xi is one where we have a one only

ย 10:34

in the ith position, and zero everywhere else.

ย So, if we sum up s of xM over all instances in their data set, we're going

ย to get a vector, where, you only get a one contribution when the instance, when

ย the M, when the Mth data instance comes up that particular value, and so that's

ย going to be M1, M2, Mk. And this is sufficient statistic because

ย the likelihood function then can be reconstructed as a product of theta i,

ย Mi, where this theta i here is the parameter for x equals little xi.

ย 11:30

Let's look at a different example. Let's look at the sufficient statistic

ย for a Gaussian distribution. So, as a reminder, this is a

ย one-dimensional Gaussian distribution that has two parameters, mu, which is the

ย mean, and sigma squared, which is the variance.

ย And that can be written as, in the following form which is one that you've

ย seen before. And we can rewrite the exponent in in the

ย following way, we can basically blow out the quadratic term in the exponent and

ย you end up with the likelihood function that has -x squared times a term plus x

ย times the term minus a constant term. And the sufficient statistics for

ย Gaussian can now be seen to be x squared, x and one.

ย Because when we multiply P of X for multiple occurrences of X, we're going to

ย end up adding up the x squared for the different, for the different data cases,

ย adding up the x's for the different data cases and then this is just going to be

ย the number of data cases. So, the s of data set P is going to be

ย the sum over M, X of M squared, the sum over M with M and this [UNKNOWN].

ย And from that, we can reconstruct the likelihood function.

ย 13:13

How do we now perform maximum likelihood estimation? So, as we talked about, we

ย want to choose theta so as to maximize the likelihood function and if we just go

ย ahead and optimize the functions that we've seen on previous slide for

ย multinomial, that maximum likelihood estimation turns out to be simply the

ย fraction. So, for the value xi, it's going to be

ย the fraction of xi in theta, which again, is a perfectly, very natural

ย estimation to use. for a Gaussian, we end up with the

ย following as the maximum likelihood estimate.

ย The mean is the empirical mean. in the distribution.

ย That is the average over all of the data cases and the standard deviation is the

ย empirical standard deviation. So to summarize, maximum likelihood

ย estimation is a very simple principle for selecting among a set of parameters given

ย data set D. We can compute that maximum likely

ย destination by summarizing a data set in terms of sufficient statistics, which are

ย typically considerably more concise than the original data set D.

ย And so, that provides us with a computationally efficient way of

ย summarizing a data set so as to the estimation.

ย And it turns out that for many parametric distributions that we care about, the

ย maximum likehood estimation has an easy to compute closed form solution given the

ย sufficient statistics.

ย