0:00

In this video, I'm going to talk about some recent work on learning a joint

model of captions and feature vectors that describe images.

In the previous lecture, I talked about how we might extract semantically

meaningful features from images. But we were doing that with no help from

the captions. Obviously the words in a caption ought to

be helpful in extracting appropriate semantic categories from images.

And similarly, the images ought to be helpful in disambiguating what the words

in the caption mean. So the idea is we're going to try in a

great big net that gets its input, stand to computer vision feature vectors

extractive for images and pack up words representations of captions and learns

how the two input representations are related to each other.

At the end of the video I'll show you a movie of the final network using words to

create feature vectors for images and then showing you the closest image in its

data base. And also using images to create bytes of

words. I'm now going to describe some work by

Natish Rivastiva, who's one of the TAs for this course, and Roslyn Salakutinov,

that will appear shortly. The goal is to build a joint density

model of captions and of images except that the images represented by the

features standardly used in computeration rather than by the ropic cells.This needs

a lot more computation than building a joint density model of labels and digit

images which we saw earlier in the course.

So what they did was they first trained a multi-layer model of images alone.

That is it's really a multi-layer model of the features they extracted from

images using the standard computer vision features.

Then separately, they train a multi-layer model of the word count vectors from the

captions. Once they trained both of those models,

they had a new top layer, that's connected to the top layers of both of

the individual models. After that, they use further joint

training of the whole system so that each modality can improve the earlier layers

of the other modality. Instead of using a deep belief net, which

is what you might expect, they used a deep Bolton machine, where the symmetric

connections bring in all pairs of layers. The further joint training of the whole

deep Boltzmann machine is then what allows each modality to change the

feature detectors in the early layers of the other modality.

That's the reason they used a deep Boltzmann machine.

They could've also used a deep belief net, and done generative fine tuning with

contrastive wake sleep. But the fine tuning algorithm for deep

Boltzmann machines may well work better. This leaves the question of how they

pretrained the hidden layers of a deep Boltzmann machine.

because what we've seen so far in the course is that if you train a stack of

restricted Boltzmann machines and you combine them together into a single

composite model what you get is a deep belief net not a deep Boltzmann machine.

So I'm now going to explain how, despite what I said earlier in the course, you

can actually pre-trail a stack of restrictive Boltzmann machines in such a

way that you can then combine them to make a deep Boltzmann machine.

The trick is that the top and the bottom restrictive bowser machines in the stack

have to trained with weights that it twices begin one directions the other.

So, the bottom Boltzmann machine, that looks at the visible units is trained

with the bottom up weights being twice as big as the top down weights.

Apart from that, the weights are symmetrical.

So, this is what I call scale symmetrical.

But the bottom up weights are always twice as big as their top down

counterparts. This can be justified, and I'll show you

the justification in a little while. The next restrictive Boltzmann machine in

the stack, is trained with symmetrical weights.

I've called them two W, here rather then W for reasons you'll see later.

We can keep training with restrictive bowsler machines like that with genuinely

symmetrical weights. But then the top one in the stack has

be-trained with the bottom up weights being half of the top down weights.

So again, these are scale symmetric weights, but now, the top down weights

are twice as big as the bottom up weights.

That's the opposite of what we had when we trained the first restricted Bolton

machine in the stack. After having trained these three

restricted Bolton machines, we can then combine them to make a composite model,

and the composite model looks like this. For the restricted Bolton machine in the

middle, we simply halved its weights. That's why they were 2W2 to begin with.

5:01

For the one at the bottom, we've halved the up-going weights but kept the

down-going weights the same. And for the one at the top we've halved

the down-going weights and kept the up-going weights the same.

Now the question is: Why do we do this funny business of halving the whites?

The explanation is quite complicated but I'll give you a rough idea of what's

going on. If you look at the layer H1.

We have two different ways of inferring the states of the units in h1, in the

stack of restricted bolts and machines on the left.

We can either infer the states of H1 bottom up from V or we can infer the

states of H1 top down from H2. When we combine these Boltzmann machines

together, what we're going to do is we're going to an average of those two ways of

inferring H1. And to take a geometric average, what we

need to do, is halve the weights. So we're going to use half of what the

bottom up model says. So that's half of 2W1.

And we're going to use half of what the top down model says.

That's half of 2W2. And if you look at the deep Boltzmann

machine on the right, that's exactly what's being used to infer the state of

H1. In other words, if you're given the

states in H2, and you're given the states in V, those are the weights you'll use

for inferring the states of H1. The reason we need to halve the weights

is so that we don't double count. You see, in the Boltzmann machine on the

right. The state of H2 already depends on V.

At least it does after we've done some settling down in the Boltzmann Machine.

So if we were to use the bottom up input coming from the first restricted

Boltzmann Machine in the stack. And we use the top down input coming from

the second Boltzmann Machine in the stack, we'd be counting the evidence

twice.'Cause we'd be inferring H1 from V. And we'd also be inferring it from H2,

which, itself, depends on V. In order not to double count the

evidence, we have to halve the weights. That's a very high level and perhaps not

totally clear description of why we have to half the weights.

If you want to know the mathematical details, you can go and read the paper.

But that's what's going on. And that's why we need to halve the

weights. So that the intermediate layers can be

doing geometric averaging of the two different models of that layer, from the

two different restricted Boltzmann machines in the original stack.