0:00

Hello, friends.

Â This is the last week of our adventure together, and

Â it's natural to feel a bit sad, but let's end on a high note.

Â We can do that in two ways.

Â First, you can wear a bright cheerful outfit, like this shirt that I'm wearing,

Â and second, let's learn about two important forms of learning,

Â supervised learning and reinforcement learning.

Â Let's start with supervised learning, which has become all the rage these days

Â given the emergence of big data and the rise of machine learning.

Â 0:44

Let's begin that classification with a fundamental problem in machine learning.

Â Suppose I gave you a bunch of image such as these, some containing faces,

Â and others containing objects, such as the adventure hat,

Â vehicles that will get your heart racing, such as these two, for

Â different reasons, and logos, of course, that you love, such as this.

Â The problem of classification is basically getting a machine to decide

Â which of these images contain faces, and which do not.

Â This is obviously a trivial problem for you brain, and

Â in fact not only can your brain decide if an image contains a face, but

Â the face will also probably trigger related associations, such as memorable

Â Bollywood dancers in some cases, and not so memorable ones in others,

Â or even memorable sentences, such as I did not have sexual relations with that woman.

Â But how can a machine solve this classification problem?

Â 1:40

Here's one way of tackling this problem.

Â Suppose each of these green and red points represents one of our images.

Â And obviously these points exist in a very high dimensional space.

Â For example, if our image had a million pixels,

Â then each of these points would exist in a million dimensional space.

Â But for simplicity, let's consider just this two dimensional space.

Â And these points, these images,

Â are labeled by the fact that either they're faces or they're not faces.

Â So we can label the face images with a +1 and

Â the images containing other objects with a -1.

Â And if we now have these labels associated with the set of face images and

Â the set of non-face images, then if we're lucky,

Â perhaps the face images cluster in one part space and

Â the other images cluster in a different part of the space.

Â So if that's the case, can you think of way of classifying new images, such as,

Â let's say, an image that I have that is now in that location in this image space,

Â and another image that is perhaps in this location, so

Â how would you classify this image here, and this image here?

Â One way of classifying these new points is to find a hyperplane that

Â separates the face images from the non-face images.

Â So, in this case, the hyperplane happens to be just a line,

Â because it's just two dimensions.

Â But in the general case of a very high dimensional image space,

Â we're going to find a hyperplane.

Â And now, what we can do is we can look at our new point, so this image.

Â And if it's above the separating line,

Â in this case, then you would label that with +1, so we'd call that a face image.

Â And if the new image is below the separating line,

Â we would call that a non-face image, so we would label that with a -1.

Â Now, the question is, can neurons do something like this?

Â Can neurons do classification?

Â 3:48

Well, let's go back to the idealized model of a neuron that we discussed in

Â the very first week of our course, where we assumed that the neuron simply

Â sums up its inputs and if the summation of all its inputs exceeds a threshold,

Â then the neuron generates an output spike.

Â So mathematically what this means is that if the inputs

Â are denoted by ui and if the synaptic weights are given by wi,

Â then what we're saying with this simple idealized model

Â is that the neuron is going to generate an output spike if

Â the weighted sum of all the inputs is bigger than some threshold.

Â Let's call that threshold mu.

Â So, if the weighted sum is bigger than this threshold mu,

Â then we have an output spike.

Â This simple model for a neuron in fact has a name.

Â And the name is perceptron.

Â 4:48

The perceptron was originally proposed by Rosenblatt in the 1950s,

Â building on the work of those pioneers of neuromodeling, McCulloch and

Â Pitts from the 1940s.

Â Here is a schematic depiction of a perception.

Â The inputs are +1 or -1, denoting a spike or no spike.

Â And as we discussed earlier, the perceptron computes the weighted

Â sum of it's inputs, compares it to it's threshold, and

Â if the weighted sum is above the threshold, you have an output of +1,

Â meaning a spike, and if the weighted sum is below or

Â equal to the threshold, then we have an output of -1, or no spike.

Â 5:32

Now, here is an equation that defines the output of a perceptron.

Â So v is either +1 or -1.

Â And the output is determined by this function theta,

Â where the function outputs +1 if its argument is bigger than 0,

Â and -1 if its argument is less than or equal to 0.

Â You can see now how if you write this equation using this function theta,

Â it implements exactly what we want.

Â So if the weighted sum minus the threshold mu

Â is bigger than 0, it means that the sum is bigger than the threshold,

Â which means you have an output of +1, and if the weighted sum is less than or

Â equal to the threshold means that this summation minus mu is less than or

Â equal to 0, which means that the output is going to be -1.

Â So what does a perceptron do?

Â Well, let's set that weighted sum expression equal to 0.

Â Now, what does this equation remind you of?

Â What does this equation define in the n-dimensional space of the inputs?

Â 6:47

You're right, the equation defines a hyperplane.

Â Or, in the special case of two dimensional inputs, the equation defines

Â a straight line, and what's more, all the input points that are above

Â the line satisfy the property that the weighted sum is bigger than the threshold,

Â because the left hand side of this equation for all these inputs,

Â is going to be a value bigger than the value 0.

Â And all the points that are below the line are going to satisfy

Â the property that the weighted sum is less than the threshold,

Â because the left hand side in those cases will turn out to be less than 0.

Â What this means is that the perceptron is going to have an output of +1 for

Â all the inputs on one side of the hyperplane and an output of -1 for

Â all the inputs on the other side of the hyperplane.

Â In other words, the perceptron can separate inputs from one class,

Â let's say class 1, from the inputs from another class, let's say class 2.

Â So you know what that means.

Â It means that perceptrons can classify.

Â In other words, they can perform linear classification.

Â Linear because they use a line or

Â a hyperplane, to separate one class of coins from the other.

Â 8:14

So here's the super wise learning problem for the perceptron.

Â We're given a set of inputs that are labeled, so

Â the red points here are labeled plus one denoting class one,

Â and the green points are labeled minus one denoting class two.

Â The problem is, how do we learn the rates and

Â the threshold for the perceptron, given these inputs and their labels.

Â In other words, how do we find a separating hyperplane,

Â by adjusting the weight and the threshold.

Â 8:48

You guessed right.

Â There's a learning rule for perceptrons, and

Â it involves adjusting the weights and the threshold according to the output error.

Â The output error is given by vd- v.

Â vd here denotes the desired output, or the label that we get with each input.

Â And v denotes the output of the perceptron.

Â Here are the update rules for the weights, and the threshold.

Â Epsilon, as you will recall, is the learning rate.

Â A positive constant that determines how fast the rates are adapted.

Â Let's see if we can understand this weight update rule,

Â in the case where the input was positive.

Â Now, in this case, the learning rule, you can see, increases the weight,

Â if the error is positive, so what does that mean?

Â It means that vd was plus 1 and the output of the perceptron was minus 1.

Â So, in order for it to do the correct thing in this case,

Â so generate an output of plus 1, the perceptron needs to increase

Â the weighted sum, so that it's above the threshold.

Â So, it can do that by increasing the weight.

Â And so, we can see, now,

Â that the learning rule's doing the right thing in this particular case.

Â Now what if the error was negative?

Â So, in that case,

Â you can see that this learning rule is going to decrease the weight.

Â So is that the right thing to do?

Â Well, if the error is negative, it means that the desired output, the label,

Â was minus 1.

Â And the output of the perceptron must have been plus 1, so

Â that gives you a negative error.

Â So in this case, what we want the perceptron learning rule to do,

Â is to make the output, which is plus 1, be a minus 1 output.

Â So you can make the output minus 1,

Â by decreasing the rated sum to be below the threshold.

Â And that's in fact, what the learning role does, it decreases the weight w i,

Â which in turn makes the weighted sum eventually go below the threshold.

Â The learning rule does the opposite for the case where u i, is negative and

Â you should be able to convince yourself that that's the right thing to do.

Â In the case of the threshold, the update rule decreases the threshold if

Â the error is positive, and increases the threshold if the error is negative.

Â So to see that this is again the right thing to do in this case, and

Â the errors positive it means that v d was plus one, and

Â the output of the perceptron was minus 1.

Â And so you can see that when you decrease the threshold, this in turn encourages

Â the output of the perceptron to go from minus1 to plus1, because now the threshold

Â has been decreased and so that again, is doing the correct thing.

Â Similarly, when the error is negative, you must have had the case that the desired

Â output was minus 1, and the perceptron's output was plus 1.

Â And so by increasing the threshold, we are now encouraging the perceptron to not

Â have the output plus 1, it's going to have the weighted sum go below the threshold,

Â because the threshold is now being increased.

Â And so once again, that's the right thing to do to make

Â sure that the perceptron's output, matches the desired output.

Â 12:04

That's great.

Â Now that we have a learning rule for the perceptron.

Â You're probably asking yourself, can perceptrons learn any function?

Â Well, let's look at the exclusive r or xr function.

Â So, here is the table for the exclusive r function.

Â And, as you already know from your logic or mathematics classes,

Â the exclusive XOR function gives you an output of plus 1, only when the two

Â inputs differ from each other, otherwise you have an output of minus 1.

Â And here's a graphical depiction of the exclusive or function.

Â Here is the two dimensional space of the inputs, and you can see that the two

Â inputs, that give you an output of plus 1 are denoted by these red points.

Â And the green points denote the two inputs,

Â that will give you a minus 1 output according to the XOR function.

Â The question that I would like to ask you is, can a perceptron

Â learn to separate the plus 1 inputs, from the minus 1 inputs?

Â In other words, can a perceptron learn the exclusive XOR function?

Â 13:13

The answer, as you might have guessed, is no, unfortunately they cannot.

Â Perceptrons can only classify linearly separable data.

Â So what if you really like the perceptron model very much, because it's

Â a simple model of a neuron and perhaps you really love that name perceptron?

Â How do we still keep the perceptron model, and handle linearly inseparable data?

Â The answer of course, is to use multiple layers of neurons and

Â this gives us multilayer perceptrons.

Â These can classify linearly inseparable data.

Â So for example, we can use those two layer perceptron to compute the XOR function,

Â and I will then encourage you to substitute the different values,

Â for u1 and u2 to verify that this 2 layer perceptron does

Â indeed compute the XOR function correctly.

Â 14:14

Now, what if you want continuous outputs rather than the plus1 and

Â minus 1 outputs you obtained from the perceptron.

Â In other words, what if you want to do regression, rather than classification?

Â One example where this might be applicable would be in teaching a network

Â to drive a truck.

Â So in this case you might be mapping the images of the road, and pedestrians, and

Â bicyclists, and so on, to appropriate steering angles for the truck.

Â So you might argue in this case, that you could get away with using classification,

Â by mapping the plus 1 outputs to swing to the left and

Â minus 1 outputs to swing to the right.

Â And I've actually seen drivers in Seattle practicing this kind of behavior.

Â But to be safe, It's better to use regression and

Â map the inputs to appropriate continuous steering angles.

Â 15:07

We can get continuous outputs from our network if we use Sigmoid functions for

Â the output of our neuron.

Â So, in the case of the perceptron we used a threshold function data, so

Â instead of using the threshold function, if we now use continuous valued functions

Â such as the Sigmoid function, then we can get continuous outputs from our network.

Â So, here's the mathematical expression for the Sigmoid function.

Â And here is a graphical depiction of the Sigmoid function.

Â You'll notice that the Sigmoid takes values between minus infinity,

Â and plus infinity, and it maps it to values between 0 and 1.

Â And so one can interpret the output of the Sigmoid as fighting rate of the neuron,

Â and so the fighting rate then lies between the minimum value of 0 And

Â a maximum firing weight value which has been normalized to be the value one.

Â So for example if a neuron has a maximum firing rate of 100 hertz and

Â a minimum firing rate of zero hertz, then we can normalize the output firing rate of

Â the neuron by dividing each firing weight by 100 and that would make the range

Â of the firing weight be between zero and one as in the case of the sigmoid.

Â The parameter beta, which appears here in the sigmoid function,

Â controls the slope of the sigmoid.

Â So, for example, if the parameter beta is large then the sigmoid

Â approaches a threshold function, like the data function we had in the perceptron.

Â And when the parameter beta is small,

Â then the sigmoid looks more like a linear function.

Â 16:54

Let's see if we can learn multilayered sigmoid networks for regression.

Â Why multilayer?

Â Well, if you have a single layer of neurons then the network is not

Â going to be very powerful as we saw in the case of the exclusive r function.

Â So let's consider the case where we have three layers.

Â So we have the input layer here.

Â We have a hidden layer of neurons and then the output layer.

Â And here is what the network does.

Â The network takes a rated sum of its inputs, and

Â that's given by this expression here.

Â And the rated sum is then passed through the sigmoid function that results in

Â an output.

Â Let's call that xj and that is the output here in the hidden layer so

Â this would be x1, x2, and so on.

Â And these outputs are then transformed by the rates

Â from the hidden layer to the output layer and that's given by this rated sum.

Â The sum over j of WIJ times XJ, and

Â that in turn is then passed through the sigmoid function again to give you

Â the output of the network for each individual neuron in the output layer.

Â Note that in this network, we're using only one hidden layer of neurons.

Â If you use many hidden layers, then we get what are called deep networks.

Â And these deep networks have received a lot of attention recently because they've

Â been shown to learn more and more complex features in the deeper layers of

Â the network and that in turn allows the deep network to learn complex functions.

Â If you're interested in learning more about these deep networks and

Â deep learning, I'd encourage you to Google deep networks and find out the details.

Â Let's now focus on this three layered network and

Â try to figure out how to learn the rates of this network.

Â Remember that we're also given the desired output d for

Â each input u because this is a supervised learning problem.

Â So how would you change these rates in order to

Â get your network to produce the desired outputs, d?

Â 19:10

Here's one way we can do that, we can minimize the output error.

Â So here is an example of an error function, E.

Â It's a function of both the big W and the little w.

Â And it's simply the sum of all the output neurons of the square

Â of the output error, di minus vi.

Â So I'd like to ask you how would you minimize this error function

Â with respect to the big W and the little w?

Â Let me give you a hint, perhaps you can use the gradient

Â of the error function with respect to the weights.

Â 19:48

That's right, you can use gradient descent to minimize this output function.

Â So here is how you could do that for the case of the big W,

Â the weights from the hidden layer to the output layer.

Â So delta Wij is going to be equal to the negative

Â of the gradient of E with respect to Wij,

Â and epsilon as before as a small positive constant known as the learning rate.

Â And if you take the derivative of E with respect to Wij,

Â you're going to get this expression here.

Â And this learning rule, this weight update rule, is known as the delta rule.

Â And that's for historical reasons, because this error,

Â this difference between the desired output and the actual output,

Â has been called delta and therefore this learning rule is called the delta rule.

Â 21:46

The answer of course lies in the chain rule from calculus.

Â The chain rule tells us that we can take a derivative such as dE over

Â dJK and we can write it as the product of two derivatives,

Â the first one being dE over dxj times

Â the second one being dxj over dwjk.

Â And now you can see that both of these derivatives can be computed.

Â The first one can be computed from the expression for E.

Â The second one can be computed from the expression for

Â xj which is the activity of dj hidden layer neuron.

Â Let me plug in this product of derivatives into the beta visual,

Â we get the very famous back propagation learning rule from

Â multilayered networks, and you can see why it's called back propagation.

Â It's called back propagation because we are propagating the errors from the output

Â layer all the way down to the hidden layer and in the case of many hidden layers

Â you can generalize this chain rule to apply to more than just one hidden layer.

Â And therefore you're propagating the errors down

Â from the output layer down to all of the hidden layers of the network.

Â I'll encourage you to look into the supplementary materials for

Â the actual derivation of the backpropagation learning rule and

Â the expressions that we get as a result of taking these derivatives.

Â 23:17

Okay, after all that hard work,

Â this is where the rubber hits the road, if you pardon the pun.

Â We're going to use back propagation to drive a truck.

Â And since our lawyers will not allow us to use back propagation to drive a real

Â truck, we going to use a simulation that was created by Keith Grochow,

Â who was a student at the University of Washington several years ago.

Â The specific task for the network is to learn to back a truck into a loading dock.

Â So here is the truck in green and here is the loading dock.

Â What we are trying to do is to train the network given both inputs and

Â desired outputs to back this truck into this space here which is the loading dock.

Â So the input to the network is going to be the position x and

Â y in two dimensions as well as the orientation data of the truck and

Â the output that we would like to get from the network is the steering angle, so

Â that the truck can back into this space denoting the loading dock.

Â The training data for the network is provided by a human

Â backing this simulated truck into the loading docks.

Â The human is providing the steering angles for

Â different positions and orientations of the truck.

Â So the question then is, can the network, given this data,

Â learn to back a truck, on its own, into the loading dock?

Â Well, what do you think, do you think it can do it?

Â Well, let's see, here we go.

Â Here's the truck during the very early stages of learning.

Â And as you can see, it's not doing very well.

Â In fact, it's driving like a maniac.

Â And it reminds me of some crazy drivers that I saw

Â when I was growing up in the Indian city of Hyderabad.

Â Well, let's see if we can help the truck a little bit by training it

Â some more on the human data.

Â Well, here we go.

Â So it's gone through 4,000 passes now of the human data.

Â And you can see that it's doing a little bit better.

Â It's getting closer to the loading dock.

Â Now let's train it some more.

Â And now let's see if it can actually get there.

Â So yes, it is actually getting very close to the loading dock.

Â So what do you think?

Â You think that you will let this network drive your car or truck?

Â Well, I wouldn't.

Â