0:00

This video introduces the learning algorithm for a linear neuron.

Â This is quite like the learning algorithm for a perceptron, but it achieves

Â something different. In a perceptron, what's happening is the

Â weight, so always getting closer to a good set of weights.

Â In a linear neuron, the outputs are always getting closer to the target outputs.

Â 0:25

The perception convergence procedure works by ensuring that when we change the

Â weights, we get closer to a good set of weights.

Â That type of guarantee cannot be extended to more complex networks.

Â Because in more complex networks when you average two good set of weights, you might

Â get a bad set of weights. So for multilayer neural networks, we

Â don't use the perceptron learning procedure.

Â And to prove that when they're learning something is improving, we don't use the

Â same kind of proof at all. They should never have been called

Â multilayer perceptrons. It's partly my fault and I'm sorry.

Â For multilayer nets we're gonna need a different way to show that the learning

Â procedure makes progress. Instead of showing that the weights get

Â closer to a good set of weights, we're gonna show that the actual output values

Â get closer to the target output values. This can be true even for non-convex

Â problems in which averaging the weights of two good solutions does not give you a

Â good solution. It's not true for perceptual learning.

Â In perceptual learning, the outputs as a whole can get further away from the target

Â outputs even though the weights are getting closer to good sets of weights.

Â 1:45

The simplest example of learning in which you're making the outputs get closer to

Â the target outputs is learning in a linear neuron with a squared error measure.

Â Linear neurons, which are also called linear filters in electrical engineering,

Â have a real valued output that's simply the weighted sum of their outputs.

Â So the output Y, which is the neuron's estimate of the target value, is the sum

Â over all the inputs i of a weight vector times an input vector.

Â So we can write it in summation form or we can write it in vector notation.

Â 2:45

So one question is why don't we just solve it analytically.

Â It's straightforward to write down a set of equations with one equation per

Â training case, and to solve for the best set of weights.

Â That's the standard engineering approach, and so why don't we use it?

Â The first answer, and the scientific answer, is we'd like to understand what

Â real neurons might be doing, and they're probably not solving a set of equations

Â symbolically. An engineering answer is that we want a

Â method that we can then generalize to multilayer, nonlinear networks.

Â The analytic solution relies on it being linear and having a squared error measure.

Â An iterative method, which we're gonna see next, is usually less efficient, but much

Â easier to generalize to more complex systems.

Â 3:36

So I'm now gonna go through a toy example that illustrates an iterative method for

Â finding the weights of a linear neuron. Suppose that every day, you get lunch at a

Â cafeteria. And your diet consists entirely of fish,

Â chips, and ketchup. Each day, you order several portions of

Â each, but on different days, it's different numbers of portions.

Â The cashier only shows you the total price of the meal, but after a few days, you

Â ought to be able to figure out what the price is for each portion of each kind of

Â thing. In the iterative approach, you start with

Â random guesses for the prices of portions. And then you adjust these guesses so that

Â you get a better fit to the prices that the cashier tells you.

Â Those are the observed prices of whole meals.

Â 4:27

So each meal, you get a price and that gives you a linear constraint on the

Â prices of the individual portions. It looks like this, the price of the whole

Â meal is the number of portion of fish, x fish, times the cost of a portion of fish,

Â w fish. And the same for chips and ketchup.

Â 5:12

So let's suppose that the true weights that the cashier using to figure out the

Â price, are 150 for a portion of fish, 50 for portion of chips and a 100 for a

Â portion of Ketchup. For the meals shown here, that will lead

Â to a price of 850. So that's going to be our target value.

Â 5:40

So for the meal with two portions of fish, five of chips, and three of ketchup, we're

Â going to initially think that the price should be 500.

Â That gives us a residual error of 350. The residual error is the difference

Â between what the cashier says and what we think the price should be with our current

Â weights. We're then gonna use the delta rule for

Â revising our prices of portions. We make the change in a weight, delta WI

Â be equal to a learning rate, epsilon times the number of portions of the i-th thing,

Â times the residual error. The difference between the target and our

Â estimate. So if we make the learning rate be one

Â over 35, so the maths stays simple, then the learning rate times the residual error

Â for this particular example is ten. And so, our change in the weight for fish

Â will be two times ten. We'll increase that weight by twenty.

Â Our change in the weight for chips will be five times ten.

Â And our change in the weight for ketchup will be three times ten.

Â 6:56

That'll give us new weights of 70, 100, and 80.

Â And notice, the weight for chips actually got worse.

Â There's no guarantee with this kind of learning that the individual weights will

Â keep getting better. What's getting better is the difference

Â between what the cashier says and our estimate.

Â 7:21

We start by defining the arrow measure, which is simply our squared residual

Â summed over all training cases. That is the squared difference between the

Â target and what the neural net predicts. Or the linear neuron predicts.

Â Squared, in some liberal training cases. And we put a one-half in front, which will

Â cancel the two, when we differentiate. We now differentiate that error measure

Â with respect to one of the weights, WI. To do that differentiation we need to use

Â the chain rule. The chain rule says that how the error

Â changes as we change a weight, will be how the output changes as we change the

Â weight, times how the error changes as we change the output.

Â The chain rule is easy to remember, you just cancel those two DYs but you can only

Â do that when there's no mathematicians looking.

Â 8:17

The reason the first one, DY by DW is written with a curly D is because it's a

Â partial derivative. That is, there's many different weights

Â you can change to change the output. And here, we're just considering the

Â change to weight i. So, DY by DWi, is actually equal to Xi,

Â and that's because Y is just Wi times Xi, and DE by DY, is just T minus Y, because

Â when we differentiate that T minus Y squared, and use the half to cancel the

Â two we just get T minus Y. So our learning rule is now, we change the

Â weights by an amount that's equal to the learning rate epsilon times the derivative

Â of the error with respect to a weight, to E by DWi.

Â And with a minus sign in front cuz we want the error to go down.

Â And that minus sign cancels the minus sign in the line above and we get that.

Â The change in a weight is the sum of all training cases of the learning rate times

Â the input value times the difference between the target and actual outputs.

Â 9:49

There may be no perfect answer. It may be that we give the linear neuron a

Â bunch of training cases with desired answers.

Â And there's no set of weights that'll give the desired answer.

Â There's still some set of weights that gets the best approximation on all those

Â training cases, minimizes that error measure.

Â Some that are all training cases. And if we make the learning rate small

Â enough and we learn for long enough, we can get as close as we like to that best

Â answer. Another question is, how quickly do we get

Â towards the best answer. And even for a linear system.

Â The learning can be quite slow in this kind of intricate learning.

Â If two input dimensions are highly correlated, its very hard to tell how much

Â of the sum of the weight on both input dimensions should be attributed to each

Â input dimension. So if for example, you always get the same

Â number of portions of ketchup and chips is, we can't decide how much of the price

Â is due to the ketchup and how much is used to the chips.

Â And if they're almost always the same, it can take a long time for the learning to

Â correctly attribute the price to the ketchup and the chips.

Â There's an interesting relationship between the delta rule and the learning

Â rule for perceptrons. So, if you, you use the online version of

Â the delta rule, but we change the weights after each training case, it's quite

Â similar to the perceptron learning rule. In perceptron learning, we increment or

Â decrement the weight vector by the input vector, but we only change the input

Â vector when we make an error. In the online version of the delta rule,

Â we increment or decrement the weight vector by the imperfector.

Â But we scale that by both the residual error and the learning rate.

Â And one annoying thing about this is we have to choose a learning rate.

Â If we choose a learning rate that's too big, the system will be unstable.

Â And if we choose a learning rate that's too small, it will take an unnecessarily

Â long time to, to learn a sensible set of weights

Â