0:47

Now, if we're trying to make a prediction over the value of a variable X, that

Â depends on the parameter theta. Well, we're just, this is just now a

Â problem of inference problem. So the probability of x,

Â is simply, the probability of x given theta.

Â Marginali times the prior over theta. Marginalizing, in this case corresponding

Â to an integration over the value of zero. And I give this, this interval over here.

Â So, if we plug through the integral, what we're going to get is the following form.

Â And I'm not going to go through the integration by parts that's required to

Â actually show this. But it's really a, a straight-forward

Â consequence of the properties of integrals of polynomials.

Â in this case we have that the probability that x takes the particular value xi

Â is one is 1 / Z times the integral over all of the parameters theta of theta i

Â which is the, probability given the parameterization theta, that x takes the

Â value of little xi times this thing over here, which is the prior.

Â And we multiply the two together, integrate out over the parameter vector,

Â theta which, in this case, is a k dimensional parameter vector.

Â And it turns out that when one does that, you end up with alpha i over the sum of

Â all J's alpha J, a quantity typically known as alpha.

Â And so we end up with a case where, the prediction over the next instance

Â represents the fraction of the instances that we've seen, as represented in the

Â hyperperimeter of Dirichlet where we have x, little xi.

Â So if alpha i represents the number of instances that we've seen where x eq-.

Â Where the variable took the value, little xi.

Â The, prediction very naturally represents.

Â It, it is simply the fraction of the instances with that property.

Â And so once again, we see that there is a natural intuition for the hyper

Â parameters as representing the motion of counts.

Â 3:25

Now, let's put these two results together and think about Bayesian prediction as a

Â function of, as the number of data instances that we have grows.

Â So here, we have a parameter theta, which initially was distributed as a Dirichlet,

Â with some set of hyper-parameters. And let's imagine that we've seen m beta

Â instances, x1 up to xm. And now we have the M plus first data

Â instance. And want to make a prediction about that.

Â So, the problem that we're trying to solve is now the probability of the M

Â plus first data instance, given the M first, the M instances that we've seen

Â previously. And so, we can once again plug that into

Â a probabilistic inference equation. So this is going to be the probability of

Â the M plus first data instance given everything, including theta times the

Â probability of theta given x[1] up to x[m] So we've introduced the variable

Â theta into this probability and we're marginalizing out over the variable

Â theta. Well, one thing that, immediately follows

Â is because of the structure of the, probabilistic graphical model here.

Â We have that xm + one is conditionally independent of all of these previous xes,

Â given theta. And so we can, cancel these from the

Â right-hand side of the conditioning bar. Which gives us, over here, probability of

Â xm + one given theta. And over here, we have the probability of

Â theta. Given x1 of xm.

Â 5:06

And, so now let's think about the blue equation, the blue expression over here,

Â which is just the posterior. Over thena, given D.

Â Which are X1 off the XM. And we've already seen what that looks

Â like. That, as we showed just on the previous

Â slide is simply Dirichlet who's hpyerperimeters are Alpha one plus m1 up

Â the Alpha 1 plus mk. And so now, we're making a prediction of

Â a single random variable from a Dirichlet that has a certain set of

Â hyperparameters. And that was a thing we showed on the

Â slide just before that. Which is simply the fraction of the out.

Â The fraction of the hyperparameter corresponding to the outcome, xi.

Â As a fraction of all of, the sum of all of the hyperparameters.

Â Where, again, just to introduce notation. Alpha is equal to the sum of the sum of

Â alpha I. And M to the sum of the MI's.

Â 6:11

Now notice what happens here. This parameter alpha that we just defined

Â which is the sum over all of the alpha I's that I have is a parameter known as

Â the equivalent sample size. And it represents the number of if you,

Â if you will imaginary samples that I would have seen prior to receiving the

Â new data, x1 of xm. Now look what happens if we multiply

Â alpha by a constant. So say we double all of our alpha I, then

Â we have we're going to let the MI's effect our estimate a lot less than for

Â smaller values of alpha. And so the larger the alpha, the more

Â confidence we have in our prior, and the less we let our data move us away from

Â that prior. So let's look at an example of the

Â influence that this, might have. So let's go back to, binomial data, or

Â Bernoulli random variable. And let's take the simplest example where

Â a prior is uniform for theta in 01. And we've previously seen that, that

Â corresponds to Dirichlet with hyperparameters 1-1.

Â 7:30

So, this is, are deascht so this a general purpose derscht slave

Â distribution, in this case the hyper parameters are one, one and let's imagine

Â we get, five data incidences of which we have four ones.

Â And one zero. And, if you actually.

Â Think about the differences between what the Bayesian estimate gives you for the

Â sixth next coin toss relative in, in, when

Â doing maximum likely estimation versus the Bayesian estimation.

Â For maximum likely estimation we have, four heads, four tails.

Â Maximum likely is is four fifths, so that's going to be the prediction for the

Â sixth instance. The Bayesian prediction on the other

Â hand, remember is going to do the hyper-parameter alpha one plus M1 divided

Â by alpha plus M which in this case is going to be one plus four divided by two

Â plus five and that's suppose to give us 5/7.

Â 8:39

Now let's look more qualitatively at the effect of the predictions, on a next

Â instance, after seeing certain amounts of data.

Â And for the moment, we're going to assume that the ratio between the number of 1s

Â and the number of 0s is fixed, so that we have one 1 for every four 0.

Â And that's the data that we are getting. And now let's see what happens as a

Â function of the sample size. So as we get more and more data, all of

Â which satisfy this particular ratio. So here we're playing around with a

Â different strength, our equivalent sample size but we're fixing the ratio of alpha

Â one to alpha zero to represent in this case the 50% level.

Â So our prior is a uniform fire but of greater and greater changing strength.

Â And so this little green line down at the bottom represents a low alpha.

Â Because we can see that the data gets pulled our, posterior.

Â So sorry. The line is drawing the posterior over on

Â the parameter or rather equivalency, the prediction of the next data instance over

Â time. And you can see here that alpha is low

Â and that means that even for fairly small amounts of data say twenty data points

Â are fairly close to the data estimates. On the other hand, this bluish line here

Â We can see that the alpha is high. And that means it takes more time for the

Â data to pull us, to the empirical fraction of heads versus tails.

Â Now let's look at varying the other parameter.

Â We're going to now fix the equivalent sample size.

Â And we're going to just start out with different prior.

Â And we can see that now we get pulled down to the 0.2 value that we see in the,

Â in the empirical data. and the further away from it.

Â We start, though. It takes us a little bit longer to

Â actually get pulled down to the data estimate.

Â But in all cases, we eventually get convergence to the value in the actual

Â data set. But, from a pragmatic perspective it

Â turns out that Bayesian estimates provide us with a smoothness where the random

Â fluctuations in the data don't don't cause quite as much random jumping

Â around as they do for example in maximum likelihood estimates.

Â So if what we have here is the actual value of the coin toss at different

Â points in the process, you can see that the blue line, this

Â light blue line corresponds to maximum likely data estimation basically bops

Â around the pheromone, especially in the low data regime.

Â Whereas the ones that use a prior, estimate to be the prior are considerably

Â smoother, and less subject to random noise.

Â In summary, Bayesian prediction combines two types of, you might call them

Â sufficient statistics. There is the sufficient statistics from

Â the real data. But there's also sufficient statistics,

Â from the imaginary samples, that, contribute, eh, to the derscht laid

Â distribution, these alpha hyper parameters, and the basion prediction

Â effectively makes the prediction about the new data instance by combining both

Â of these. Now, as the amount of data increases,

Â that is, at the asymptotic limit of many beta instances.

Â The term that corresponds to the real data samples is going to dominate.

Â And therefore, the prior is going to become vanishingly small in terms of the

Â contribution that it makes. So at the limit, the Bayesian prediction

Â is the same as maximum likelihood destination.

Â 12:50

But initially in the early stages of destination, before we have a lot of data

Â the the priors actually make a significant difference.

Â and we see that the Dirichlet hyper parameters basically determine both our

Â prior beliefs, initially before we have a lot of data as well as the strength of

Â these beliefs that is how long it takes for the data to outweigh the prior, and

Â move us towards what we see in the empirical distribution.

Â But importantly, even as we've seen here in the very simple examples, and as we'll

Â see later on when we talk about learning with Bayesian networks,

Â it turns out that this Bayesian learning paradigm is considerably more robust in

Â the sparse data regime, in terms of its generalization ability.

Â