0:03

All right. Now that we know what mean field is and we've derived the formulas,

Â let's see an example.

Â It is called an Ising model.

Â This model is widely used in physics.

Â So we have a model,

Â that is a two dimensional lattice.

Â Its elements are running variables that can take value of -1 or +1.

Â We also need a functional neighbor that returns a set of neighboring elements.

Â For example, here, that will have three neighbors.

Â We define the joint probability over all these variables in the following way.

Â It would be proportional to an exponent of 1/2 times J,

Â that is a parameter form model,

Â times the sum over all edges,

Â and the product of the two random variables.

Â If the neighboring values have the same sign,

Â it will contribute one to the total sum.

Â And if the product is -1,

Â it will contribute -1 to the total sum.

Â And also we have another term that is sum over all nodes are the letters, bi times yi.

Â This is called an external field.

Â So we'd know this function exponent of some terms as phi y,

Â and we'll see what we can do with this model.

Â But first of all,

Â let's interpret it somehow.

Â If J is greater than one,

Â then the values yi will tend to have the same sign.

Â This is the case for ferromagnetics.

Â And yi's can be interpreted as spins of atoms.

Â If J is less than one,

Â the neighboring spins will tend to anti-align,

Â and this is shown on the right.

Â So, we have defined the distribution up to a normalization constant.

Â But to compute it, we'll have to sum up over all possible states.

Â And there are two power and square terms and it seems impossible for large lattices.

Â So let's try to apply a meaningful approximation to compute the p or y approximately.

Â So we'll try to approximate this by product of

Â some terms and each term would be a distribution over each variable.

Â So here's an example.

Â We have four nodes here,

Â and the central node i.

Â The external field parameterized by branches

Â B is shown here as the yellow and the green sides.

Â On the yellow side,

Â there is a strong negative field,

Â and on the green side there is strong positive field.

Â So on the positive field,

Â the values of the corresponding nodes would try to have both positive sign.

Â In negative field, the nodes would try to have a negative sign.

Â So actually in this case,

Â the left node would say something like,

Â I feel the negative field on me.

Â And the other three nodes will say something like,

Â they feel the positive field.

Â And so they have some information and

Â they will try to contribute it to the current node i.

Â And this would be done using mean field formula.

Â So here is our joint distribution.

Â And here's a small picture of our model.

Â So we're interested now in deriving the update formula for the mean field approximation.

Â So what we'd like to do,

Â is to find the q of Yk.

Â We'll do this using mean field approximation.

Â So, the idea is that the neighboring points already know some information,

Â about the external fields B.

Â For example, this says that there is external field of the sign plus,

Â this also plus, here minus and minus.

Â So it has some information and they want to propagate it to our node.

Â And we'll see how it is done.

Â So the formula that we derived in the previous video looks as follows.

Â We know that the logarithm of qk,

Â let me write it down an index K here,

Â equals to the expectation over all variables except for the K,

Â and we write it down as q minus K,

Â the logarithm of the actual preview that we are trying to approximate.

Â So it will be p of y, plus some constant.

Â Notice here that we didn't write down the full distribution,

Â since we do not know the minimization constant.

Â However, here we can take it out into the constant.

Â So now we can omit the terms that do not depend on Yk.

Â And if we write it down carefully,

Â we'll get the following formula.

Â So have that expectation,

Â over all terms except for the K's.

Â So the overall of these terms.

Â So we can omit the exponent and have it on this thing.

Â So let me write it down.

Â It's like this, J sum over J that are

Â neighbors of K, Yk,Yj.

Â So I omitted one half here since in this formula,

Â we use each edge twice.

Â And so here, we only want to write down once.

Â And plus Bk, Yk.

Â So this goes under the exponent and goes on constant.

Â All right.

Â We can tell you the expectation and put it under the summation.

Â We'll get J times the sum over J that are

Â neighboring points for the current note.

Â We take expectation over all variables except for the

Â K. So Yk is actually constant with respect to the integration,

Â so we can write it down here and take the expectation Yj.

Â And this term is simply a constant with respect to integration here.

Â So we'll have just Bk, Yk plus on constant.

Â So let's note the expected value of Yj as mu J.

Â It's just that mean value of the J's node.

Â And actually, the information that the node obtained from the other nodes or

Â from the external field in this point is contained in the value of mu J.

Â So, this equals to,

Â we can actually group the terms corresponding to Yk and get the following function.

Â This would be Yk times the J sum over mu J under the neighbors.

Â 7:56

Plus BK. Since we don't want to write this down multiple times,

Â let's say that this thing equals to some constant M,

Â and find the close constant.

Â So now, I want to estimate the distribution QK but for now it's only up to a constant.

Â Let's take the exponent of both parts,

Â and also remember that the interval of QK should be equal to one.

Â In this case, it means that Q of plus one,

Â plus Q of minus one should be equal to one.

Â We can plug in this formula here.

Â I will have exponent of,

Â so here Yk equals to plus one exponent of M times the exponent of the constant,

Â let's right it down C,

Â plus again the same constant C,

Â and the E to power of minus M, and it should be equal to one.

Â C here should be equal to one over E to the power of M,

Â plus E to the power of minus M. This is the value for the constant.

Â And finally, we can compute the probabilities.

Â So the probability that Q equals one,

Â equals E to the power of M over this constant C,

Â each with the power of M, plus E to the power of minus M. What is this function?

Â What do you think? We can multiply it by each with the power of minus M,

Â we'll have one over one plus E with the power of the minus 2M,

Â and actually equals the sigmoid function of 2M.

Â All right, so now we can update the distributions,

Â however, we need to do one more thing.

Â Notice here that we used the value of Î¼j.

Â For the other nodes to be able to use these constants Î¼j,

Â we need to update the Î¼j for our node.

Â We need to compute the Î¼K. This is an expectation of Yk.

Â It is simply Q at the position plus one,

Â minus Q at the position of minus one.

Â We can again plug in the similar formulas.

Â This would be each with the power of M minus E to

Â the power minus M over the normalization constant.

Â As you may notice,

Â this actually equals to the hyperbolic tangent.

Â Here's our final formula.

Â Let's again and see how it works.

Â We iterate our nodes,

Â we select some node.

Â We compute the probabilities Q,

Â and then update the value of Î¼k.

Â And while we update the probabilities Q, we also use the values Î¼j.

Â Also actually, here it is QK,

Â which is actually true since we're estimating the values for the Î¼K.

Â Now that we've derived an update formula,

Â let's see how this one will work for different values of J.

Â Here's our setup.

Â We have two areas,

Â the white one corresponds to the positive external field,

Â and the black one corresponds to the negative external field.

Â If J is 0,

Â then with probability one on the white area,

Â the spins would tend to be plus one.

Â On the black area,

Â the probability would be one for having minus one.

Â And everywhere else we'll have the probability 0.5 for each possible state.

Â It happens because there is no interaction between neighboring points since J is 0.

Â You will have the negative J.

Â We'll have a chess-like field.

Â The neighboring points would try to have opposite signs,

Â there will be blacks and whites nearby.

Â As we go further from external field,

Â the interaction is slower,

Â and so when we're really far away from the field,

Â the probability is actually nearly 0.5,

Â which will indicate that there can be either plus one or minus one.

Â All right. The final example is a strong positive J.

Â In this case, we'll get a picture like this,

Â one part would be white that means that we'll

Â have plus one with probability one on the left upper corner,

Â and everywhere else we'll have minus one with probability one.

Â So actually, this situation should have be symmetric.

Â Why didn't we get the opposite picture when there would be a right lower corner,

Â black, and other things would be white?

Â This is actually a property of the KL diversions.

Â Here I have a [inaudible] the bimodal distribution,

Â and I try to, approximate it by fitting the KL diversions.

Â There could be two possible cases.

Â One is left is that the KL diversions would fit one node,

Â and on the second one is that we fit something in the middle.

Â What do you think would happen when minimize the KL diversions

Â between the bayesian distribution and [inaudible].

Â Let's first of all see what are the properties of those two things.

Â The second one captures the statistics,

Â so it would have for example the correct mean.

Â However, the first one has

Â the very important property that the mode has high probability.

Â In the second example,

Â the probability of mode is really low.

Â It seems that the mode is actually impossible, and so,

Â for many practical cases,

Â the first fit would be nicer and actually this is the case.

Â Let's see why it happens. All right.

Â So here's our KL diversions.

Â It is an integral of Q of Z times log of the ratio.

Â Let's see what happens if we assign

Â non-zero probability to the Q and zero probability to the P-star.

Â In this case, the KL diversions would have a value of plus infinity.

Â And so, the KL divergence would try to avoid

Â giving non-zero probability to the regions that are impossible from the first tier.

Â It is called a zero avoiding property of KL divergence,

Â and it turns out to be useful in many cases.

Â