0:05

[MUSIC]. Hello neuro explorers.

Â Last week, we learned how neurons can be connected to form feed forward and

Â recurrent networks. This week, we learn how these connections

Â can be adapted using synaptic elasticity, allowing the brain to learn about the

Â world from its inputs. Now, what better way to start this

Â journey than to gaze upon this beautiful drawing of the hippocampus by the great

Â Romon y Cayal from a 100 years ago. it was in the hippocampus that some of

Â the first results on synaptic plasticity were obtained.

Â One type of synaptic plasticity that's observed in the brain is long term

Â potentiation or LTP. An LTPs defined as an experimentally

Â observed increase in the synaptic strength from some neuron A to another

Â neuron B that can last for several hours or even days.

Â And the way you would induce LTP is by causing some neuron A to fire a burst of

Â spikes and if the neuron A is connected to some other neuron B where B is also

Â excited, it's depolarized or it's firing some spikes.

Â Then what you would see is an increase in the size of the excitatory postsynaptic

Â potential. So that means that for the same input

Â initially you might have a small EPSP. But after you pair neuron A with neuron B

Â several times then what you'd observe is an increase in the size of the EPSP which

Â indicates that the strength of the connection from neuron A to neuron B has

Â been increased. The counterpart to Long-Term Potentiation

Â or LTP is Long-Term Depression or LTD. And LTD corresponds to an experimentally

Â observed decrease in the synaptic strand that lasts for hours or days and you can

Â obtain LTD when you have the following situation.

Â So if the neuron A firing some spikes but neuron B does not fire any spikes or it

Â does not depolarize. So in this situation, when you have some

Â input from neuron A but no output coming from neuron B then what you observe is a

Â decrease in the EPSB size. So if the initial EPSB for a single input

Â was as shown at the very top here. And as the pairing occurs where you have

Â some input from A but no output from B, you would expect to see a decrease in the

Â size of EPSP. And what that implies is that the

Â connection strength from neuron A to neuron B has been decreased.

Â 2:33

Now here's something that's interesting. Even before LTP and LTD were discovered

Â in the brain, a Canadian psychologist named Donald Hebb predicted that

Â something like LTB should occur in the brain.

Â He suggested a learning rule for how neurons in the brain should adapt the

Â connections among themselves and this learning rule has been called Hebb's

Â Learning Rule or Hebbian Learning Rule and here's what it says.

Â If a neuron A repeatedly takes part in firing another neuron B, then the synapse

Â from A to B should be strengthened. And here is a cartoon of what this

Â learning rule implies, if we have a neuron A that is firing and that intern

Â is participating in the firing of another neurons.

Â So the neuron B produces for example one or a few spikes.

Â Now if this situation occurs, Hebb's Learning Rule predicts that one ought to

Â increase the strength of the connection from neuron A to neuron B, because neuron

Â A is participating in the firing of neuron B.

Â And so what we then get is for the same input from neuron A now and the input

Â from other neurons you have an increase in the activity.

Â You have more spikes from neuron B. And so another way of phrasing Hebbs

Â learning rule is through that famous mantra that you already heard during the

Â first week of lectures and that is that neurons that fire together wire together.

Â 4:08

Now mantras are great for chanting but they're hard to implement on a computer.

Â Now let's see if we can formalize Hebb's rule as a mathematical model.

Â So, let's start with a linear feet forward neuron.

Â So here's the neuron with an outward v and it's receiving some inputs we're

Â calling the input vector u and the synaptic weights from the inputs to the

Â output neuron are given by a synaptic weight vector w.

Â So this is very similar to the feed forward networks that we considered in

Â the previous set of lectures last week. Now, if we assume that the dynamics of

Â this network of the firing fate is fast, then we can look at the steady state

Â output. And that's given by this equation.

Â So the output firing rate of the neuron is nothing but just the dot product of

Â the weights, the synaptic weights with the inputs and you can write it as a dot

Â product. Or you can write it as w transpose u or

Â you can write it as u transpose w. Now, here's how you can write Hebb's rule

Â mathematically. You can use a differential equation to

Â capture how the rates from the input neurons to the output neuron change as a

Â function of time. So there is some time constant tau sub w

Â that governs how fast the weights are changing.

Â And we set tau sub w dw dt to be equal to the product of the input firing rates and

Â the output firing rate. So how does this capture the intusion

Â behind Hebb's rule? But remember that in Hebb's rule the

Â increase the strength of the connection from an input neuron A to an output

Â neuron B, if there is both activity from neuron A as well as activity from neuron

Â B. And this product of the input firing

Â rates with the output firing rate captures that intuition.

Â 6:01

Now in order to implement this differential equation on a computer, you

Â need to discretize it. And so if you look at the discrete

Â implementation of this differential equation, then this leads you to a eight

Â update rule. And the weight update rule is shown here.

Â So this is how you update the weights given inputs.

Â And so the weight update rule tells you that the weights at time step i plus 1 is

Â given by the weights at time step i plus some epsilon some positive constant.

Â So this is called the learning rate. And that is multiplied by u times v.

Â Or another way of expressing this equation is to say that the change in the

Â weight. So delta w is equal to the learning rate

Â epsilon times uv. In order to understand the Hebb rule, it

Â is useful to look at the average effect of this rule on the synaptic weights w.

Â So here is the Hebb rule from the previous slide and if you want to look at

Â the average effect of this rule, then we can take the average of the right-hand

Â side with respect to all the inputs u. So these brackets over here denote the

Â average. And if we now substitute the value for v

Â from the previous slide again. Then what we find is that the Hebb rule

Â modifies the weight w according to the input correlation matrix, where the

Â correlation matrix, as you might know, is given by simply the average of uu

Â transpose. So what does this mean?

Â What does it mean to change the weight w according to the input correlation

Â matrix? Well think about that for a minute.

Â 7:46

Well the Hebb rule that we've been discussing so far only increases synaptic

Â weights and this models a phenomenon of LTP or long term potentiation in the

Â brain. But as we discussed earlier the brain

Â also exhibits LTD or long term depression, which involves decreasing the

Â strength of the connection from one neuron to another.

Â Now can we model both LTP and LTD using a single learning rule?

Â In other words can we derive a learning rule that can both increase or decrease

Â the strength of a synaptic connection? One rule that incorporates both LTP and

Â LTD is the covariance rule and we'll come to why it's called that in just a minute.

Â Here is the differential equation for the covariance rule and you'll notice that it

Â is again a product of the input firing rate with the output firing rate.

Â Except that now the output firing rate as a difference term that includes the

Â difference between the output firing rate and the average of the output firing rate

Â so what is the effect this difference term?

Â Well consider the case when the output firing rate is bigger than the average

Â output firing rate. So in this case you're going to have a

Â positive quantity here, which means that when you multiply the input firing rates

Â with a positive quantity you're going to have an increase in the synaptic

Â strength. And that is going to result in LTP.

Â On other other hand, if the output firing rate is low, so for example.

Â It is less than the average output firing rate or even the case where there is no

Â output so v could be 0. In that case what you're going to get is

Â a negative quantity here and so when you multiply the input firing rates with a

Â negative quantity you're going to get a decrease in the synaptic weight.

Â And so that results in LTD. So what does the Covariance Rule do?

Â Well, just as we did with the Hebb rule we can look at the average effect of this

Â rule. And that means taking the average of the

Â right hand side of the rule with respect to all the inputs u.

Â And if you substitute the value for v and you simplify these expressions then what

Â you get is the fact that the. Covariance rule is changing the weight

Â vector w according to surprise, surprise the input covariance matrix.

Â So here's the input covariance matrix. It's simply u u transposed, the average

Â of that minus the average of u with the average of u transposed.

Â At this point I would like you to think about what it means for w to be changed

Â according to the input covariance matrix. What do you think w would converge to

Â when it's modified according to this equation?

Â We will answer that question towards the end of the lecture.

Â 10:32

Now let's ask the question are these learning rules stable?

Â In other words does w converge to a stable value or does it explode?

Â Now how do we answer this question? Well one could look at the length of w as

Â a function of time and see if the length of w remains bounded or if the length of

Â w grows without any bounds. Let's first look at the Hebb rule.

Â So here is the Hebb rule and let's look at how the length of w squared changes as

Â a function of time. So let's take the derivative of the

Â length of w squared with respect to time and when we do that, we get this

Â expression here. And if we substitute the value for dw dt

Â according to the Hebb rule We have this expression.

Â And note that w transpose u here is nothing but the output firing rate v.

Â And if we substitute that value here, we get this expression.

Â Now unless v is always equal to 0, this expression is going to be positive.

Â And so what we then have is the fact that the derivative of the length of w squared

Â with respect to time, is always positive. What does that mean?

Â It means that the length of w is going to keep increasing, which means that w grows

Â without bound. Well, you might be thinking that's not

Â too surprising, because the Hebb rule only increases synaptic waves.

Â It only models LTP and so perhaps that's why w grows without bound.

Â Well if that's the case then what about the covariance rule?

Â So as we discussed, the covariance rule incorporates both LTP and LTD.

Â And therefore it can both increase synaptic ways as well as decrease

Â synaptic ways and perhaps that makes the covariance rule stable.

Â What do you think? Do you think it's stable?

Â 12:27

Well, here's the answer and I'm sorry to say that it's not good news.

Â If you take the derivative of the length of w squared with respect to time as

Â before and we simplify the resulting expression then, if you further take the

Â average of the right hand side of that expression what you find is that the.

Â Derivative of the length of w squared with respect to time is always positive.

Â And what that means is that the length of w when changed according to the

Â covariance rule grows without any bound, which means that w grows without any

Â bound. So, how do we stabilize the Hebb rule and

Â the covariance rule? Well one in which you can do that is by

Â forcing a constraint. On the synaptic weight vector w.

Â So what kind of a constraint can we impose?

Â Well, you could impose the constraint that the length of w should always be

Â equal to 1 and how do we do that? Well, each time that you update the

Â weight vector according to a new input, we simply divide the resulting weight

Â vector with the length of that weight vector.

Â And this ensures that the length of the weight vector always equals 1.

Â Now this seems like a hack and perhaps it's not even biologically plausible.

Â So is there a more elegant way of imposing a constraint on the length of

Â the weight vector. Now let's look at the last of our Hebbian

Â learning rules and this one's called Oja's rule named after its discoverer.

Â And Oja's rule is similar to the Hebb rule in that we again multiply the input

Â firing rates with the output firing rate except that now we subtract a term alpha

Â v squared w from u times v and alpha is some positive value.

Â Now the question is, is Oja's rule stable?

Â What do you think? Well, let's do what we did before, which

Â is take the derivative of the length of w squared with respect to time.

Â So when we do that, we get this differential equation for the length of w

Â squared. So, looking at this differential

Â equation, do you think that the length of w squared converges to a particular value

Â or do you think that the length of w squared grows without bound.

Â 14:40

Well, here's the answer. So length of w squared in fact does

Â converge to a particular value and it converges to the value 1 or alpha.

Â And you can see that by setting the derivative equal to zeros in that case

Â unless v is equal to 0 we have the fact that the length of w squared is equal to

Â 1 over alpha because this term over here has to be equal to 0.

Â And if that's the case then the length of w itself must be equal to 1 over square

Â root of alpha. So what this tells us is that w for Oja's

Â Rule does not grow without bound, which means that the rule is stable.

Â Okay, let's summarize what we've learned so far about Hebbian learning.

Â The basic Hebb rule involves multiplying the input firing rates with the output

Â firing rate and this models the phenomenon of LTP in the brain.

Â We found out that this learning rule is unstable unless we impose a constraint on

Â the length of w after each weight update. The covariance rule involves multiplying

Â u with v minus the average value of v, which means that we can now model both

Â LTP and LTD. But we found out that that's not

Â sufficient to make the learning rules stable.

Â So this learning rule is also unstable unless we impose a constraint on the

Â length of w. And finally we considered Oja's rule and

Â we found out that Oja's rules in fact stable and the length of the weight

Â vector converges to the value 1 over square root of alpha.

Â 16:14

Okay, we've arrived at the finale of the lecture we going to answer the question

Â what does Hebbian Learning do anyway. We going to start with the averaged Hebb

Â rule so as you recall the averaged Hebb rule is given by this differential

Â equation where Q is the input correlation matrix.

Â And what we would like to do is solve this differential equation defined wt.

Â So what is w as the function of time when its being changed according to this

Â differential equation. So how do we solve this equation?

Â Any ideas? Well, if you guessed eigenvetors, you

Â would be right. We can always rely on our dear friends,

Â the eigenvectors. So, as before, let's write our vector wt

Â in terms of the eigenvectors of the correlation matrix.

Â Now recall that the input correlation matrix is going to be a real and symetric

Â matrix which means that the eigenvectors are going to be orthonormal, which means

Â that we can write any vector including the vector wt.

Â As a linear combination of the eigenvectors.

Â Now if we substitute our expression for wt in the differential equation for the

Â average Hebb rule, then we can simplify as before and we can get this

Â differential equation for the coefficients.

Â And when we solve the differential equation for the coefficient, let's say

Â ci, then we have this solution. And when we substitute this solution into

Â our expression for wt, then we get this solution for the weight vector as a

Â function of time. So, what is this equation telling us

Â about the synaptic weight vector w as a function of time?

Â It's telling us that the synaptic weight vector w is a linear combination of the

Â eigenvectors of the input correlation matrix.

Â And furthermore, it's telling us that the coefficients for these eigenvectors have

Â terms that are exponentially dependent on the eigenvalues of the correlation

Â matrix. So what do you think will happen to w as

Â time goes on? So when t becomes very large, what do you

Â think will happen to w? When t becomes large, the largest

Â eigenvalue terms so that one that has the largest eigenvalue.

Â Lets say it's the eigen value lamba 1 is the largest eigen value then that term

Â dominates this linearly combition so what we get.

Â Then is the result that the rate vector turns out to be proportional to the first

Â eigenvector or the principal eigenvector of the input correlation matrix.

Â And furthermore, if we're using Oja's rule as you know, the length of the

Â weight vector then converges to 1 square root of alpha.

Â So in that case, the weight vector approaches the value e1 divided by square

Â root of alpha. We've actually shown something very

Â exciting. We've shown that the brain can actually

Â do statistics and that's in addition to what we showed last week which was that

Â the brain can do calculus. There seems to be no stopping the brain.

Â Well, let's look at why we think the brain does statistics.

Â So it turns out the Hebbian learning rule that we just analyzed implements the same

Â thing as the statistical technique of principal component analysis or PCA.

Â So to understand what principal component analysis is all about let's look at a

Â simple example. So here is some two dimensional data.

Â We have these points which represent the values you want and u2, which comprise

Â the input vector u. And if we start the Hebb rule with an

Â initial weight vector that's given by this dashed line, then the Hebb rule

Â rotates this initial weight vector to align itself with the direction of

Â maximum variance. So here is the.

Â Cloud of data and the final weight vector is going to be parallel to this line

Â which is the direction of maximum variance.

Â Now when we apply the Hebb rule to some data that has been shifted so the data

Â from here can be shifted to a different location.

Â Let's say with input mean. Two and two.

Â So, in that case we find that the Hebb rule does not do what we want it to do.

Â which is, it finds this direction as the direction of maximum variance going

Â through the origin of this two dimensional plot.

Â And that is really not the direction of maximum variance, the direction of

Â maximum variance is given again by. This direction but luckily when we apply

Â the equal variance rule we find that it does indeed find the direction of maximum

Â variance. So it's taken care of the fact that the

Â input mean is no longer 00 but it's 2 and 2 and that is accounted for by the equal

Â variance rule. So the equal variance based Hebb rule is

Â able to find again the direction of maximum variance.

Â 21:19

So in summary what we have shown is that Hebbian learning learns a weight vector

Â that is aligned with the principal eigenvector of the input correlation or

Â the input covariance matrix. In other words, it finds the direction of

Â maximum variance in the input data. And that is precisely what principal

Â component analysis does but now why is that interesting?

Â Well, principal component analysis is a very important technique used in a

Â variety of fields for tasks such as damage [INAUDIBLE] reduction.

Â So for example here what we've done is we've shown that this two-dimensional

Â data can be compressed to just one dimension by projecting each of these two

Â dimensional points onto their corresponding locations along this

Â particular line. And so we now have a compression from 2 d

Â to the 1 d location along this particular line and that's an example of

Â dimensionality reduction or compression. And you can imagine that when we have a

Â very large input dimension, such as the number of pixels in an image.

Â Then this type of technique where we find the directions of maximum variance in

Â natural images or natural movies is indeed going to be extremely useful.

Â Because you can compress a very high dimensional space such as the space of

Â the input image or the space of the input video to may be a very small number of

Â principle eigenvectors the dominant eigenvectors of the input covariance

Â matrix. Well that's great but what if we give a

Â neuron this data, what do you think the weight vector for the neuron will

Â converge to if we apply the covariance learning rule?

Â As you might have guessed the covariance rule ends up finding the weight vector

Â that is aligned with the direction of maximum variance in this data set.

Â Now unfortunately as many of you will agree this data set seems to consist of

Â two clusters of data points. So here's one cluster and here's the

Â other. And so it appears that this particular

Â data set is not correctly modeled by principal component analysis.

Â So just finding the directional maximum variance through these two clusters

Â doesn't seem to provide us with a very satisfying model of this particular

Â dataset. So the question that I would like to

Â leave you with is what should a network of neurons learn from such data?

Â This will be the topic of our next lecture.

Â And we will encounter the interesting alogrithm known as competitive learning

Â and this will allow us to segue into generative models.

Â And this will in turn lead us into the exciting field known as unsupervised

Â learning. So until then, hasta la vista and

Â goodbye.

Â