0:00

In this video, I'm going to explain, how a Boltzmann machine models a set of binary

Â data vectors. I'm going to start by explaining, why we

Â might want to model a set of binary data vectors, and what we could do with such a

Â model if we had it. And then I'm gonna show how the

Â probabilities assigned to binary data vectors are determined by the weights in a

Â Boltzmann machine. Stochastic Hopfield nets with hidden

Â units, which we also call as Boltzmann machines are good at modelling binary

Â data. So, given a set of binary training

Â vectors, they can use the hidden units to fit a model per assigns the probability to

Â every possible binary vector. Per several reasons, why you might like to

Â be able to do that. If, for example you had several

Â distributions of binary vectors, you might like to look at a new binary vector and

Â decide which distribution it came from. So, you might have different kinds of

Â documents, and you might represent a document by, a number of binary features

Â each of which says, whether there is more than zero occurrences of a particular word

Â in that document. For different kinds of documents, you

Â would expect different kinds of the different words, may be you'll see

Â different correlations between words And so you could use a set of hidden units to

Â model the distribution for each document. And then you could pick the most likely

Â document, by seeing. And then you could assign a test document

Â to the appropriate class, by seeing which class of document is most likely to have

Â produced that binary vector. You could also use Boltzmann machines for

Â monitoring complex systems to detect unusual behavior.

Â Suppose for example that you have a nuclear power station, and all of the

Â dials were binary. So you get a whole bunch of binary numbers

Â that tell you something about the state of the power station.

Â What you'd like to do, is notice that it's in an unusual state.

Â A state that's not like states you've seen before.

Â And you don't want to use supervised learning for that.

Â Because really you don't want to have any examples of states that cause it to

Â blowup. You'd rather be able to detect that it's

Â going into such a state without every having seen such a state before.

Â And you could do that by building a model of a normal state and noticing that this

Â state is different from the normal states. If you have models of several different

Â distributions. You can complete the posterior probability

Â that a particular distribution produced the observed data by using Bayes' Theorem.

Â So giving the observed data, the probability it came from Model I, under

Â the assumption that it came from one of your models, is the probability that Model

Â I would have produced that data, divided by the same quantity for all models.

Â Now I want to talk about two ways of producing models of data in particular

Â binary vectors. The most natural way to think about

Â generating a binary vector is to first generate the states of some latent

Â variables, And then use the latent variables to

Â generate the binary vector. So in a causal model, we use two

Â sequential steps. These are the latent variables, or hidden

Â units, and we first pick the states of the latent variables from their prior

Â distributions. Often in the causal model, these will be

Â independent in the prior. So their probability of turning on, if

Â they were binary latent variables, would just depend on some bias that each one of

Â them has. Then, once we picked a state for those, we

Â would use those to generate the states of the visible units by using weighted

Â connections in this model. So this is a kind of neural network,

Â causal, generative model. It's using logistic units, and it uses

Â biases for the hidden units and weights on the connections between hidden and visible

Â units to assign a probability to every possible visible vector.

Â The probability of generating a particular vector v, is just the sum of all the

Â possible hidden states of the probability of generating those hidden state times the

Â probability of generating v, given that you've already generated that hidden

Â state. So, that's a causal model, factor analysis

Â for example is a causal model using continuous variables.

Â And, it's probably the most natural way to think about generating data.

Â In fact, some people when they say generated model mean, the causal model

Â like this. But just a completely different kind of

Â model. A Boltzmann machine is an energy based

Â model, and, in this kind of model, you don't generate data causally.

Â 4:55

It's not a causal generative model. Instead everything is defined in terms of

Â the energies of joint configurations of visible and hidden units.

Â There's two ways of relating the energy of a joint configuration to its probability.

Â You can simply define the probability to be the probability of a joint

Â configuration of the visible and hidden variables is proportional to e to the

Â negative energy of that joint configuration.

Â Or you can define it procedurally by saying we are going to define the

Â probability as the probability finding the network in that state after we've updating

Â all the stochastic binary units for enough time so that we reached thermal

Â equilibrium. The good news is that those two

Â definitions agree. The energy of a joint configuration of the

Â visible and hidden units has five terms in it.

Â So I've put the negative energy to save having to put lots of minus signs.

Â And so the negative energy of the joint configuration VH.

Â That's with vector V on the visible units, and H on the hidden units,

Â Has bias terms where VI is the binary state of the Ith unit in vector V.

Â 6:23

So that's the first two terms. Then there's the visible-visible

Â interactions, And to avoid counting each of those

Â interactions twice, we can, just say, we're going to count within c's, I, and j

Â and make sure that I's always less than j. That'll avoid counting the interaction of

Â something with itself, and also avoid counting pairs twice, and so we don't have

Â to put a half in front. Then there's the visible hidden

Â interactions. My WIK is a weight on a visible hidden

Â interaction. And then there's the hidden to hidden

Â interactions. So the way we use the energies to define

Â probabilities is that the probability of a joint configuration over vnh is

Â proportional to e to the minus vh. To make that an equality we need to

Â normalize the right hand side by all possible configurations over the visible

Â and hidden and that's what the divisor there is.

Â That's often called the partition function.

Â That's what physicists call it. And notice it has exponentially many

Â terms. To get the probability of a configuration

Â of the visible units alone, we have to sum over all possible configurations of the

Â hidden units. So P of V is the sum over all possible Hs,

Â of each of the minus the energy you get with that H, normalized by the partition

Â function. I want to give you an example of how we

Â compute the probabilities of the different visible vectors, because that'll give you

Â a good feel for what's involved. It's all very well to see the equations,

Â but I find that I understand it much better when I've worked through the

Â computation. So let's take a network with two hidden

Â units and two visible units. And we'll ignore biases, so we just got

Â three weights here. To keep things simple, I'm not gonna

Â connect visible units to each other. So the first thing we do is write down all

Â possible states of the visible units. I need to put them in different colors,

Â and I'm going to write each state four times,

Â Because for each state of visible units, there's four possible states of the hidden

Â units that could go with it. So that gives us sixteen possible joint

Â configurations. Now, for each of those joint

Â configurations, we're going to compute it's negative energy minus E.

Â So if you look at the first line, when all of the units are on.

Â The negative energy will be +two -one, +one is +two.

Â 9:12

So these are the un-normalized probabilities of the configurations.

Â Their probabilities are proportional to this.

Â If we add all those up to 39.7 and then we divide everything by 39.7, we get the

Â probabilities of joint configurations. There they all are.

Â Now, if we want the probability of a particular visible configuration, we have

Â to sum over all the hidden configurations that could go with it.

Â And so we add up the numbers in each block.

Â And now we've computed the probability of each possible visible vector in a

Â Boltson's machine that has these three weights in it.

Â Now let's ask how we get a sample from the model when the network's bigger than that.

Â Obviously, in the network we just computed, we can figure out the

Â probability of everything'cause it's small.

Â But when the network's big, we can't do these exponentially large computations.

Â So, if there's more than a few hidden units, we can't actually compute that

Â partition function, there's too many terms in it.

Â But we can use Markov Chain Monte Carlo to get samples from the model by starting

Â from a random global configuration. And then picking units at random and

Â dating them stochastically based on their energy gaps.

Â Those energy gaps being determined by the states of all the other units in the

Â network. If we keep doing that until the Markov

Â chain reaches its stationary distribution, then we have a sample from the model.

Â And the probability of that sample is related to its energy by the Boltzmann

Â distribution, that is, the probability of the sample is proportional to each-(the of

Â the minus energy. What about getting a sample from the posterior distribution

Â over hidden configurations, when given a data vector?

Â It turns out we're going to need that for learning.

Â