0:00

In this video I'm going to describe how to use an RBM to model real value data.

Â The idea is that we make the visible units.

Â Instead of being binary stochastic units, the linear units with Gaussian noise.

Â When we do this, we get problems with learning.

Â And it turns out a good solution to those problems is to then make the hidden units

Â be rectified linear units. With linear Gaussian units for the

Â visible, and rectified linear units for the hiddens, it's quite easy to learn a

Â restricted Boltzmann machine that makes a good model of real value data.

Â We first used restricted Boltzmann machines with the images of handwritten

Â digits. For those images.

Â Intermediate intensities caused by a pixel being only partially inked can be

Â modelled quite well by probabilities, that is numbers between one and zero that

Â are actually the probability of a logistical unit being on.

Â So we treat partially inked pixels. As having a probability of being inked.

Â This is incorrect but it works quite well.

Â However it won't work for real images. In a real image the intensity of a pixel

Â is almost always, almost exactly the average of its neighbors.

Â So its got a very high probability of being very close to that average and a

Â very small probability of being a little further away.

Â And you can't achieve that with a logistic unit.

Â Mean field logistic units are unable to represent things like the intensity is

Â 69. but very unlikely to be 71. or 67. So we need some other kind of unit.

Â The obvious thing to use is a linear unit with Gaussian norms.

Â So we model pixels as Gaussian variables. We can still use alternating, get

Â sampling, to run the Markoff chain required for the cross-divergence

Â learning. But we need to use a much smaller

Â learning range, otherwise it will tend to blow up.

Â The equation looks like this. The first term on the right hand side, is

Â a kind of parabolic containing function. It stops things blowing out.

Â So determining that sum contributed by the Ith visible unit is parabolic in

Â shape. It looks like this.

Â It's parabola with its minimum at the bias of the Ith unit.

Â And as the Ith unit departs from that value, we add energy quadratically.

Â So that tries to keep the Ith visible unit close to VI.

Â The interactive term between the visible and the hidden units looks like this.

Â And if you differentiate that with respect to the I, you can see that you

Â get a constant. It's the sum over all J, of H J W I J

Â divided by sigma I. So that term with its constant gradient

Â looks like this. And when you add together, that top down

Â contribution to the energy is linear, and the parabolic containment function.

Â You'll get a parabolic function, but with the mean shifted away from BI.

Â And how much it shifted depends on the slope of that blue line.

Â So the effect of the hidden units is just to push the mean to one side.

Â It's easy to write down an energy function like this.

Â And it's easy to take derivatives off it. But when we try learning with it, we

Â often get problems. There were a lot of reports in the

Â literature that people could not get these Gaussian binary RBM's to work.

Â And it is indeed extremely hard to learn tight variances for the visible units.

Â It took us a long time to figure out why it's so hard to learn those visible

Â variances. This picture helps.

Â If you consider the effect that visible unit I has on hidden unit J.

Â When visible unit I has a strong standard deviation sigma I, that has the effect of

Â exaggerating the bottom up weights. That's because we need to measure the

Â activity of I in units of its standard deviation.

Â So when the standard deviation is small, we need to multiply the weight by a lot.

Â If you look at the top down effect of J on I, that's multiplied by sigma I.

Â So when the standard deviation of a visible unit I is very small, the bottom

Â up effects get exaggerated, on the top down effects get attenuated.

Â The result is that we have a conflict where either we have bottom up effects

Â that are much too big or top down effects that are much too small.

Â And the result is that the hidden units tend to saturate and be firmly on or off

Â all the time, and this will mess up learning.

Â So the solution is to have many more hidden units than visible units.

Â That allows small weights between the visible and hidden units to have big top

Â down effects, because of so many hidden units.

Â But of course, we really need the number of hidden units to change as that

Â standard deviation sigma I gets smaller. And on the next slide, we'll see how we

Â can achieve that. I'm going to introduce stepped sigmoid

Â units. The idea is we make many copies of each

Â stacastic binary hidden unit. All the copies have the same weights, and

Â the same bias that's learned B But in addition to that adapted bias B they have

Â a fixed offset to the bias. The first unit has an offset of -1.5. The

Â second unit has an offset of -1.5. The third one has an offset of minus -2.5,

Â and so on. If you have a whole family of sigmoid

Â units like that, with the bias changed by one between neighbouring members of the

Â family, the response code looks like this.

Â If the total in product is very low, none of them are turned on.

Â As it increases, the number that get turned on increases linearly.

Â This means that as the standard deviation on the previous slide gets smaller, the

Â number of copies of each hidden unit that get turned on gets bigger and we achieved

Â just the effect we wanted, which we get more top-down effect to drive these

Â visible units that have small standard deviations.

Â Now it's quite expensive to use a big population of binary stochastic units

Â with offset biases, because for each one of them, we need to put the total input

Â through the logistic function, but we can make some fast approximations which work

Â just as well. So the sum of the activities of a whole

Â bunch of sigmoid units with offset ballasts, which is shown in that

Â summation. Is approximately equal to log of one plus

Â E to the X and that in turn is approximately equal to the maximum of

Â nought and X. And we can add some noise to the X if we

Â want. So the first term in the equation looks

Â like this. The second term looks like that.

Â And you can see that the sum of all those sigmoids in the first term will be a

Â curve like that. And we can approximate that by a linear

Â threshold unit that has a value of zero unless it's above threshold.

Â In which case its value increases linearly with its input.

Â Contrastive Divergence Learning works well for the sum of a bunch of stochastic

Â logistic units with offset biases. And in that case.

Â You get a noise variance that's equal to the logistic function.

Â But the output of that sum. Alternatively we can use that green curve

Â and use rectified linear units. They're much faster to compute because

Â you don't need to go through the logistic many times.

Â And can trust divergence works just fine with those.

Â One nice property of rectified linear units is that if they have a bias of

Â zero, they exhibit scale equivariance. This is a very nice property to have for

Â images. What scale equivariance means is that if

Â you take an image x and you multiply all the pixel intensities by a scalar a.,

Â then the representation of ax in the rectified linear units would be just a

Â times the representation of x. In other words, when we scale up all the

Â intensities in the image, we scale up the activities of all the hidden units but

Â all the ratios stay the same. Rectified linear units aren't fully

Â linear because if you add together two images, the representation you get is not

Â the sum of the representations of each unit separately.

Â This property of scale equivariance is quite similar to the property of

Â translational equivariance, convolutional nets off.

Â So if we ignore the pooling for now, in a convolution on that, if we shift an image

Â and look at the representation, the representation of a shifted image is

Â just a shifted version of the representation of the unshifted image.

Â So in a convolutional net without pooling, translations of the input just

Â flow through the layers of the net without really affecting anything.

Â The representation of every layer is just translated.

Â