0:00

In this video, I'll go into more detail about how we can speed up the Boltzmann

Â Machine Learning Algorithm by using cleverer ways of keeping Markov chains

Â near the equilibrium distribution, or by using what are called mean field methods.

Â The material is quite advanced and so it's not really part of the course.

Â There won't be any quizzes on it and it's not on the final test.

Â You can safely skip this video. It's included for people who are really

Â interested in how to get deep Boltzmann machines to work well.

Â There are better ways of collecting the statistics than the method that Terry

Â Snofsky and I originally came up with. If we start from a random state, it may

Â take a long time to reach thermal equilibrium.

Â 0:46

Also, there's no easy tests for whether you've reached thermal equilibrium, so we

Â don't know how long we need to run for. So, the idea is why not start from

Â whatever state you ended up in last time you saw that particular data vector?

Â So, we remember the interpretation of the data vector in the hidden units, and we

Â start from there. This stored state, the interpretation of

Â the data vector, is called a particle. Using particles that persist gives us a

Â warm start and it has a big advantage. If we were at equilibrium before and we

Â only updated the weights a little bit, it'll only take a few updates of the units

Â in a particle to bring it back to equilibrium.

Â 1:47

So, here's the method for directing statistics introduced by Radford Neal in

Â 1992. In the positive phase, you have a set of

Â data specific particles, one or a few per training case. And each particle has a

Â current value that's the configuration of the hidden units plus which data vector it

Â goes with. You sequentially update all the hidden

Â units a few times in each particle with the relevant data vector clamped.

Â 2:31

In the negative phase, you keep a set of fantasy particles.

Â That is, these are global configurations. And again, after each weight update, you

Â sequentially update all the units in each fantasy particle a few times.

Â Now, you're updating the visible units as well.

Â 2:50

And for every connected pair of units, your average, SiSj, over all the fantasy

Â particles. The learning rule is then the change in

Â the weights is proportional to the average you got with data, averaged over all

Â training data, and the average you got with the fantasy particles when nothing

Â was clamped. This works better than the learning rule

Â that Terry Snofsky and I introduced, at least for full batch learning.

Â 3:24

However, it's difficult to apply this approach to mini batches.

Â And the reason is, that by the time we get back to the same data vectorn if we're

Â using mini batch learning, the weights would have been updated many times.

Â Son the stored data specific particle for that data vector won't be anywhere near

Â thermal equilibrium anymore. The hidden units won't be in thermal

Â equilibrium with the visible units of the particle given the new weights.

Â 3:57

And again, we don't know how long we're going to have to run for, before we get

Â close to equilibrium again. So, we can overcome this by making a

Â strong assumption about how we understand the world.

Â It's a kind of a epistemological assumption.

Â 4:16

We're going to assume that when a data vector is clamped, the set of good

Â explanations, that is states of the hidden units, that act as interpretations of that

Â data vector is uni-modal. That means we're saying that, for a given

Â data vector, there aren't two very different explanations for that data

Â vector. We assume that for sensory input, there is

Â one correct explanation. And if we have a good model of the data,

Â our model will give us one energy minimum for that data point.

Â 4:50

This is a restriction on the kinds of models we're willing to learn.

Â We're going to use a learning algorithm that's incapable of learning models in

Â which a data vector has many very different interpretations.

Â Provided we're willing to make this assumption, we can use a very efficient

Â method for approaching thermal equilibrium or an approximation to thermal

Â equilibrium, with the data. It's called a mean field approximation.

Â 5:23

So, if we want to get the statistics right, we need to update the units

Â statistically and sequentially. And the update rule is the probability of

Â turning on unit, i, is the logistic function of the total input it receives

Â from the other units in its bias. Where Sj, the state of another unit, is a

Â stochastic binary thing. Now, instead of using that rule, we could

Â say, we're not going to keep binary states for unit i, we're going to keep a real

Â value between zero and one which we call a probability.

Â And that probability at time t + one is going to be the output of the logistic

Â function. The more you put in is the bias, and the

Â sum of all the probabilities at time t times the weights.

Â So, we're replacing this stochastic binary thing by a real value probability.

Â 6:27

And that's not quite right because this stochastic binary thing is inside a

Â non-linear function. If it was a linear function, things would

Â be fine. But because the logistics non-linear, we

Â don't get the right answer when we put probabilities instead of fluctuating

Â binary things inside. However, it works pretty well.

Â 6:52

It can go wrong by giving us biphasic oscillations because now we're going to be

Â updating everything in parallel. And we can normally deal with those by

Â using what's called damped mean field where we compute that pi of t1.

Â + one. But, we don't go all the way there.

Â We go to a point in between where we are now, and where this update wants us to go.

Â So, in damped mean field, we'll go to lambda times the place we are now, plus

Â one minus lambda times the place the update rule tells us to go to.

Â And that will kill oscillations. Now, we can get an efficient mini batch

Â learning procedure for both the machines, and this is what Russ Salakhutdinov

Â realized. In the positive phase, we can initialize

Â all probabilities at 0.5. We can clamp a data vector on the visible

Â units, and we can update all the hidden units in parallel using mean field until

Â convergence. And for mean field, you can recognize

Â convergence is when the probability stop changing.

Â 8:14

In the negative phase, we do what we were doing before.

Â We keep a set of fantasy particles, each of which has a value that's a global

Â configuration. And after each weight update, we

Â sequentially update all the units in each fantasy particles a few times.

Â 8:31

And then, for every connected pair of units, we average SiSj, these stochastic

Â binary things, over all fantasy particles. And the difference in those averages is

Â the learning rule. That is, we change the weights by an

Â amount proportional to that difference. If we want to make the updates for the

Â fantasy particles more parallel, we can change the architecture of the Boltzmann

Â machine. So, we're going to have a special

Â architecture that allows alternating parallel updates for the fantasy

Â particles. We have no connections within a layer, and

Â we have no skip-layer connections, but we allow ourselves lots of hidden layers.

Â 9:24

And, it's really a general Boltzmann machine with lots of missing connections.

Â All those skipped layer connections, if they were present.

Â We wouldn't really have layers at all, it would just be a general Boltzmann machine.

Â But, in this special architecture, there's something nice we can do.

Â 9:43

We can update the states for example the first hidden layer and the third hidden

Â layer, given the current states of the visible units and the second hidden layer.

Â And then, we can update the states of the visible units in the second hidden layer.

Â And then, we can go back and update the other states,

Â And we can go backwards and forwards like this.

Â And so, we can update half the states of all the units in parallel and that'll be a

Â correct update. So, one question is, if we have a deep

Â Boltzmann machine like that trained by using mean field for the positive phase

Â and updating fantasy particles by alternating between even layers and odd

Â layers for the negative phase, can we learn good models of things like the MNIST

Â digits, or indeed, a more complicated things?

Â So, one way to tell whether you've learned a good model is after learning, you remove

Â all the input and you just generate samples from your model.

Â So, you run the Markov chain for a long time until it's burned in, and then you

Â look at the samples you get. So, Russ Salakhutdinov used a eep

Â Boltzmann machine to model MNIST digits using mean field for the positive phase,

Â And alternating updates of the layers of the particles for the negative phase.

Â And the real data looks like this. And the data that he got from his model

Â looks like this. You can see, they're actually fairly

Â similar. The model is producing things very like

Â the MNIST digits so it's learned a pretty good model.

Â 11:25

So here's a puzzle. When he was learning that, he was using

Â mini-batches with 100 data examples and also he was using 100 fantasy particles,

Â the same 100 fantasy particles for every mini-batch.

Â And the puzzle is, how can we estimate the negative statistics with only 100 negative

Â examples to characterize the whole space? For all interesting problems, the global

Â configurations base is going to be highly multi model.

Â 12:06

There's an interesting answer to this. The learning interacts with the Markov

Â chain that's being used to gather the negative statistics, either one that's

Â used to update the fantasy particles, and it interacts with it to make it have a

Â much higher effective mixing rate. That means, we cannot analyze the learning

Â by thinking of it being an outer loop that updates the weights,

Â And an inner loop that gathers statistics with a fixed set of weights.

Â The learning is affecting how effective that inner loop is.

Â 12:42

The reason for this is that whenever the fantasy particles outnumber the positive

Â data, the energy surface is raised, and this has an effect on the mixing rate of

Â the Markov chain. It makes the fantasies rush around

Â hyper-actively, And they move around much faster than the

Â mixing rate of the mark of chain to find better current static weights.

Â 13:10

If there's a mode in the energy surface that has more fantasy particles than data,

Â the energy surface will be raised until the fantasy particles escape from that

Â mode. So, the mode on the left has four fantasy

Â particles and only two data points. So, the effect of the learning is going to be

Â to raise the energy there. And that energy barrier might be much too

Â high for a Markov chain to be able to cross, so the mixing rate will be very

Â slow. But, the learning will actually spill

Â those red particles out of that energy minimum by raising the minimum.

Â And we get filled up and the fantasy particles will escape and go off somewhere

Â else, to some other deep minimum. So, we can get out of minima that the

Â Markov chain would not be able to get out of, at least, not in a reasonable time.

Â 14:03

So, what's going on here is the energy surface is really being used for two

Â different purposes. The energy surface represents our model,

Â but it's also being manipulated by the learning algorithm to make the Markov

Â chain mix faster. Or rather, to have the effect of a

Â faster-mixing Markov chain. Once the fantasy particles have filled up

Â one hole, they'll rush off to somewhere else and deal with the next problem.

Â An analogy for them is that their like investigative journalists who rush in to

Â investigate some nasty problem. As soon as the publicity has caused that

Â problem to be fixed, instead of saying, okay, everything is okay now.

Â They rush off to find the next nasty problem.

Â