0:00

In this lecture, I'll introduce belief nets.

Â One of the reason I abandoned back propagation in the 1990's is because it

Â required too many labels. Back then, we just didn't have data sets

Â with sufficient numbers of labels. I was also influenced by the fact that

Â people managed to learn with very few explicit labels.

Â However, I didn't want to abandon the advantages of doing gradient descend

Â learning to learn a whole bunch of weights.

Â So the issue was, was there another objective function that we could do

Â grading decentive? The obvious place to look was generative

Â models where the objective function is to model the input data rather than

Â predicting a label. This meshed nicely with a major movement

Â in statistics and artificial intelligence called the graphical models.

Â The idea of graphical models was to combine discrete graph structures for

Â representing how variables depended on one another.

Â With real valued computations that inferred the probability of one variable,

Â given the observed values of other variables.

Â Boltzmann Machines were actually a very early example of a graphical model,

Â But they were undirected graphical models. In 1992, Radford Neal pointed out that

Â using the same kinds of units as we used in Boltzmann machines, we could make

Â directed graphical models which he called Sigmoid Belief Nets.

Â And the issue then became, how can we learn Sigmoid belief nets?

Â The second problem is that for deep networks, the learning time does not scale

Â well. When there's multiple hidden layers, the

Â learning was very slow. You might ask why this was,

Â And we now know that one of the reasons was we did not initialize the weights in a

Â sensible way. Yet, another problem is the back

Â propagation can get stuck in poor local optima.

Â These are often quite good, so back propagation is useful.

Â But we can now show that for deep nets, the local optima you get stuck in, if you

Â start with small random weights are typically far from optimal.

Â There is the possibility of retreating to simpler models that allow convex

Â optimization. But, I don't think this is a good idea.

Â Mathematicians like to do that because they can prove things.

Â But in practice, you're just running away from the complexity of real data.

Â So, one way to overcome the limits of back propagation is by using unsupervised

Â learning. The idea is that we want to keep the

Â efficiency and simplicity of using a gradient method and stochastic mini batch

Â descent for adjusting weights. But, we're going to use that method for

Â modeling the structure of the sensory input, not for modeling the relation

Â between input and output. So the idea is, the weights are going to

Â be adjusted to maximize the probability that a generative model would have

Â generated the sensory input. We already saw that in learning Boltzmann

Â machines. And one way to think about it is, if you

Â want to do computer vision, you should first learn to do computer graphics.

Â To first order, computer graphics works and computer vision doesn't.

Â The learning objective for a generative model, as we saw with Boltzmann machines,

Â is to maximize the probability of the observed data not to maximize the

Â probability of labels given inputs. Then the question arises, what kind of

Â generative model should we learn? We might learn an energy based model like

Â the Boltzmann machine, Or we might learn a causal model made of

Â idealized neurons, and that's what we'll look at first.

Â 3:41

Well finally, we might learn some kind of hybrid of the two, and that's where we'll

Â end up. So, before I go into causal belief nets

Â made of neurons, I want to give you a little bit of background about artificial

Â intelligence and probability. In the 1970's and early 1980's, people in

Â artificial intelligence were unbelievably anti-probability.

Â When I was a graduate student, if you mentioned probability, it was assigned

Â that you were stupid and that you just hadn't got it.

Â Computers were all about discrete single processing, and if you'd introduce any

Â probabilities they would just infect everything.

Â It's hard to conceive of how much people are against probability, so here's a quote

Â to help you. I'll read it out.

Â Many ancient Greeks supported Socrates opinion that deep, inexplicable thoughts

Â came from the gods. Today's equivelant to those gods is the

Â erratic, even probabilistic neuron. It is more likely that increased

Â randomness of neural behavior is the problem of the epileptic and the drunk,

Â not the advantage of the brilliant. That was in Patrick Henry Winston's first

Â AI textbook, in the first edition. And it was the general opinion at the

Â time. Winston was to become the leader of the

Â MIT AI Lab. Here's an alternative view.

Â All of this will lead to theories of computation which are much less rigidly of

Â an all-or-none nature than past and present formal logic.

Â 5:41

I think if von Neumann had lived, the history of artificial intelligence might

Â have been somewhat different. So, probabilities eventually found their

Â way into AI by something called graphical models,

Â Which are a marriage of graph theory and probability theory.

Â In the 1980's, there was a lot of work on expert systems in AI that use bags of

Â rules for tasks such as, medical diagnosis or exploring for minerals.

Â Now, these were practical problems so they had to deal with uncertainty.

Â They couldn't just use toy examples where everything was certain.

Â People in AI dislike probability so much that even when they were dealing with

Â uncertainty, they didn't want to use probabilities.

Â So, they made up their own ways of dealing with uncertainties that did not involve

Â probabilities. You can actually prove that this is a bad

Â bet. Graphical models were introduced by Pearl,

Â Heckman, Lauritz and many others who shared that probabilities actually worked

Â better than the ad hoc methods developed by people doing expert systems.

Â Discrete graphs were good for representing what variable dependent on what other

Â variables. But once you have those graphs, you then

Â needed to do real value computations that respected the rules of probability so that

Â you could compute the expected values of some nodes in the graph, given the

Â observed states of other nodes. Belief nets is the name that people in

Â graphical models give to a particular subset of graphs which are directed

Â acyclic graphs. And typically, they use sparsely connected

Â ones. And if those graphs are sparsely

Â connected, they have clever inference algorithms that can compute the

Â probabilities of unobserved nodes efficiently.

Â But, these clever of algorithms are exponential in the number of nodes that

Â influence each node, so they won't work for densely connected nodes.

Â So, belief net is directed acyclic graph composed of stochastic variables,

Â And here's a picture of one. In general, you might observe any of the

Â variables. I'm going to restrict myself to nets in

Â which you only observe the leaf nodes. So, we imagine is these unobserved hidden

Â causes, and they may be lead, And they eventually give rise to some

Â observed effects. Once we observe some variables, there's

Â two problems we'd like to solve. The first is what I call the inference

Â problem, and that's to infer the states of unobserved variables.

Â Of course, we can't infer them with certainty, so what we're after is the

Â probability distributions of unobserved variables.

Â And if unobserved variables are not independent of one another, given the

Â observed variables, there is probability distributions are likely to be big

Â cumbersome things with an exponential number of terms in.

Â 8:45

The second problem is the learning problem.

Â That is, given a training set composed of observed vectors of states of all of the

Â leaf nodes, How do we adjust the interactions between

Â variables to make the network more likely to generate that training data?

Â So, adjusting the interactions would involve both deciding which node is

Â affected by which other node, And also deciding on the strength of that

Â effect. So, let me just say a little bit about the

Â relationship between graphical models and neural networks.

Â 9:54

Their graph was sparsely connected. And the initial problem they focused on

Â was how to do correct inference. Initially, they weren't interested in

Â learning because the knowledge came from the experts.

Â By contrast, for neural nets, learning was always a central issue and hand wiring the

Â knowledge was regarded as not cool. Although, of course, wiring in some basic

Â properties, as in convolutional nets, was a very sensible thing to do.

Â But basically, the knowledge in the net came from learning the training data, not

Â from experts. Neural networks didn't aim to have

Â interpretability or sparse connectivity to make the inference easy.

Â Nevertheless, there are neural network versions of belief nets.

Â So, if we think about how to make generative models out of idealized

Â neurons, There's basically two types of generative

Â model you can make. This energy based models, where you

Â connect binary stochastic neurons using symmetric connections, and then you get a

Â Boltzmann machine. A Boltzmann machine, as we've seen, is

Â hard to learn. But if we restrict the connectivity, then

Â it's easy to learn a restricted Boltzmann machine.

Â However, when we do that, we've only learned one hidden layer.

Â And so, we're giving up on a lot of the power of neural nets with multiple hidden

Â layers in order to make learning easy. The other kind of model you can make is a

Â causal model. That is a directed acyclic graph composed

Â of binary stochastic neurons. And when you do that, you get a sigmoid

Â belief net. In 1992, Neal introduced models like this

Â and compared them with Boltzmann machines and showed that Sigmoid belief nets were

Â slightly easier to learn. So, a Sigmoid belief net is just a belief

Â net in which all of the variables are binary stochastic neurons.

Â To generate data from this model, you take the neurons in the top layer.

Â You determine whether they should be ones or zeros based on their biases,

Â So you determine out stochastically. And then, given the states of the neurons

Â in the top layer, you'd make stochastic decisions about what the neurons in the

Â middle layer should be doing. And then, given their binary states, you

Â make decisions about what the visible effect should be.

Â And by doing that sequence of operations, a causal sequence from layer to layer,

Â You would get an unbiased sample of the kinds of vectors of visible values that

Â your neural network believes in. So, in a causal model, unlike a Boltzmann

Â machine, it's easy to generate samples.

Â