0:00

We're now going to apply Hessian-free optimization, to the task of modeling

Â character strings from Wikipedia. So, the idea is, you read a lot of

Â Wikipedia and then try to predict the next character.

Â Before we get to see what the model learns, I want to describe why we need

Â multiplicative connections and how we can implement those multiplicative connections

Â efficiently in a recurrent neural network. I need to start by explaining why we chose

Â to model character strings rather than strings of words, which is what you

Â normally do when you're trying to model language.

Â The web is composed of character strings. Any learning method that's powerful enough

Â to understand what's going on in the world by reading the web, ought to find it

Â trivial to learn which strings make words. As we'll see, this turns out to be true.

Â So we're going to be very ambitious here. We want something that will read Wikipedia

Â and understand the world. If we have to pre-process the text in

Â Wikipedia into words it's going to be a big hassle.

Â There's all sorts of problems. The first problem is morphemes.

Â The smallest units of meaning, according to linguists, are morphemes.

Â So we're going to have to break up a word into these morphemes if we want to deal

Â with it sensibly. The problem is, it's not quite clear what

Â morphines are. There's things that are a bit like

Â morphemes, but that a linguist wouldn't call a morpheme.

Â So in English, if you take any word that starts with the letters sn, it has a very

Â high chance of meaning something to do with the lips or nose, particularly the

Â upper lip or nose. So words like snarl, and sneeze, and snot,

Â and snog, and snort. There's too many of these words for it

Â just to be coincidence. Many people say yes but what about snow?

Â That's got nothing to do with the apple lips or nose. But, ask yourself something,

Â why is snow such a good word for cocaine? Then there's words that come in several

Â pieces. So normally, we'd want to treat New York

Â as one lexical item. But if we're talking about the New York

Â minster roof, then we might want to treat New and York as two separate lexical

Â items. And then there's languages like Finnish.

Â Finnish is an agglutinative language, so it puts together lots of morphemes to make

Â great big words. So here's an example of a word in Finnish

Â that takes about five words in English to say the same thing.

Â 2:56

I have no idea what this word means. But despite my lack of understanding, it

Â makes the point. So here's an obvious kind of recurring net

Â we might use to try and model character strengths.

Â It has a hidden state and in this case, we're going to use 1500 hidden units.

Â And, the hidden state dynamics is that the hidden state at time T provides inputs to

Â determine the hidden state at time T+11. And the character also provides some

Â inputs. So we add together the effect of the

Â current character with the previous hidden state to get the new hidden state.

Â And then when we arrive at a new hidden state, we try and predict the next

Â character. So, we have a single Softmax over the 83

Â characters, and we get the hidden state to try and assign high probability to the

Â correct next character, and low probability to the others.

Â And we train the whole system by backpropagating from that Softmax,

Â The low probability of getting the correct character.

Â We backpropagate that through the hidden to output connections back through the

Â hidden to character connections, and then back through the hidden to hidden

Â connections, and so on and all the way back'til the beginning of the string.

Â 4:20

It's a lot easier to predict 86 characters than 100,000 words.

Â So it's easier to use a Softmax at the output, we don't have the problem of a

Â great, big Softmax. Now, to explain why we didn't use that

Â kind of recurrent net, but instead used a different kind of net that worked quite a

Â lot better. You could arrange all possible character

Â strings into a tree with a branching ratio of 86, in our case.

Â And what I'm showing here, is a tiny little subtree, of that great big tree.

Â In fact, this little subtree will occur many times, but with different things that

Â are represented by that dot, dot, dot before the fix.

Â So this represents that we had a whole bunch of characters, then we had f and

Â then i and then x. And now if we get an i, we're going to go

Â to the left. If we get an e, we're gonna go to the

Â right, and so on. So each time we get a character, we move

Â one step down in this tree to a new note. There's exponentially many nodes in the

Â tree of all character strings of length n. So this is going to be a very big tree.

Â We couldn't possibly store it all. If we could store it all, what we'd like

Â to do is put a probability on each of those arrows.

Â And that will be the probability of producing that letter, given the context

Â of the node. In an RNN, we try and deal with the fact

Â that the full tree is enormous by using a hidden state vector to represent each of

Â these nodes. So now, what the next character has to do

Â is take the hidden state vector that's representing the whole string of

Â characters followed by fix and operate on the hidden state vector to produce the

Â appropriate new hidden state vector if the next character was an i.

Â So when you see an i, you want to turn the hidden state vector into a new hidden

Â state vector. A nice thing about implementing these

Â nodes in this character tree by using the hidden state of recurrent neural network,

Â is that we can share a lot of structure. For example, by the time we arrive at that

Â node, that says f, i, x, we may have decided that it's probably a verb.

Â And if it's a verb, then i is quite likely because of the ending i, n, g. And that

Â knowledge that i is quite likely with a verb, can be shared with lots of other

Â nodes that don't have f I x in. So we can get I to operate on the part of

Â the state that represents that it's a verb, and that can be shared between all

Â the verbs. Notice that, it's really the conjunction

Â of the current state we're at and the character that determines where we want to

Â go. We don't want i, to give us a state that's

Â expecting to get an n next if it wasn't a verb.

Â So, we don't want to say that i tends to make you expect an n next.

Â We really want to say, if you already think it's a verb, then when you see an,

Â i, you should expect an n next. It's the conjunction of the fact that we

Â think it's a verb, and that we saw an i, that gets us into this state labeled f, i,

Â x, i, that's expecting to see an n. So we're going to try and capture that by

Â using multiplicative connections. Instead of using the character inputs to

Â their current net to give extra additive input to the hidden units, we're going to

Â use those characters to swap in a whole hidden-to-hidden weight matrix.

Â The character is going to determine the transition matrix.

Â 9:45

So, what each factor does is it first computes a weighted sum for each of its

Â two input groups. So, we take the vector state of Group a,

Â which I just call a, and we multiply that by the weight sum connections coming into

Â the factor. In other words, we take the scale of

Â product of the vector a and the weight vector u, and that gives us a number at

Â the left hand vertex of that triangle. Similarly, we take the vector stage of

Â Group b and we multiply it by the weight factor w, and we get another number off

Â the bottom vertex of the triangle. We now multiply those two numbers together

Â and that gives us a number or scalar. And we use that scalar to scale the

Â outgoing weights v in order to provide input for Group c.

Â So the input to Group c is just the product of the two numbers that come into

Â the two vertices of the triangle, times the outgoing weight factor v.

Â 12:08

We can rearrange that equation. So that we get one scalar product, and

Â then we rearrange the last bit so that now, we take the outer product of the

Â weight vector u and the weight vector v, And that gives us a matrix. And the scalar

Â product that would be computed by multiplying b by w is just a coefficient

Â on that matrix. So we get a scalar coefficient, we

Â multiply a rank one matrix by that scalar coefficient to give us that scalar matrix.

Â And then, we multiply the current hidden state a by this matrix to determine the

Â input the factor f gives to the next hidden state.

Â If we sum that up over all the factors, the total input to Group c is just a sum

Â over all factors of a scaler times a rank one matrix, and that sum is a great big

Â matrix, that's the transition matrix, and it gets multiplied by the current hidden

Â state to produce the new hidden state. So, we can see that we synthesized the

Â transition matrix, Actually, these rank one matrices provided

Â by each factor. And what the current character in Group b has done is, is it's

Â determine the weight on each of these rank one matrices.

Â So, bw times w determines the scalar weight, the scalar coefficient to put on

Â each of the matrices, actually which we are going to compose this great big

Â character specific right matrix. So here's a picture of the whole system,

Â we have a number of factors, in fact we'll have about 1500 factors.

Â And the character input is different in that only one of those is active so there

Â will only be one relevant weight at a time.

Â And that weight from the current character k, which called w, kf, is the gain that's

Â used on the rank one matrix got by taking the outer product of u and v.

Â