0:00

In this video, I am going to give an overview of various types of models that

Â have been used for sequences. I'll start with the simplest kinds of

Â model, which is ultra aggressive models, that just try and predict the next term or

Â the sequence from previous terms. I'll talk about more elaborate variants of

Â them using hidden units. And then I'll talk about, more interesting

Â kinds of models, that have hidden state, and hidden dynamics.

Â These include linear dynamical systems and hidden Markov models.

Â Most of these are quite complicated kinds of models, and I don't expect you to

Â understand all the details of them. The main point of mentioning them is to be

Â able to show how recurrent your own networks are related to models of that

Â kind. When we're using machine learning to model

Â sequences, we often want to turn one sequence into another sequence.

Â For example, we might want to turn English words into French words or we might want

Â to take a sequence of sand pressures and turn it into a sequence of word identities

Â which is what's happening in speech recognition.

Â 1:13

Sometimes we don't have a separate target sequence, and in that case we can get a

Â teaching signal by trying to predict the next term in the input sequence.

Â So the target output sequence is simply the input sequence with an advance of one

Â time step. This seems much more natural, than trying

Â to predict one pixel in an image from all the other pixels or one patch of an image

Â from the rest of the image. One reason it probably seems more natural

Â is that for temporal sequences, there is a natural order to do the predictions in.

Â Whereas for images it's not clear what you should predict from what.

Â But in fact a similar approach works very well for images.

Â 1:58

When we predict the next term in a sequence, it blurs the distinction,

Â between supervised and unsupervised learning, that I made at beginning of the

Â course. So we use methods that were designed for

Â supervised learning to predict the next term.

Â But we don't require separate teaching signal.

Â So in that sense, it's unsupervised. I'm now going to give a quick review of

Â some of the, other models of sequences, before we get on to using recurrent neural

Â nets to model sequences. So a nice simple model for sequences that

Â doesn't have any memory is an auto regressive model.

Â What that does is take some previous terms in the sequence and try and predict the

Â next term basically as a weighted average of previous terms.

Â The previous terms might be individual values or they might be whole vectors.

Â And a linear auto regressive model would just take a weighted average of those to

Â predict the next term. We can make that considerably more

Â complicated by adding hidden units. So in a feedforward neural net, we might

Â take some previous input terms, put them through some hidden units, and predict the

Â next term. Memory list models are only one subclass

Â of models that can be used for sequences. We can think about ways of generating

Â sequences, and one very natural way to generate a sequence is to have a model

Â that has some hidden state which has its own internal dynamics.

Â So, the hidden state evolves according to its internal dynamics, and the hidden

Â state also produces observations, and we get to see those observations.

Â That's a much more interesting kind of model.

Â 4:04

If the dynamics of the hidden state is noisy and the way it generates outputs

Â from its hidden state is noisy, then by observing the output of a generative model

Â like this, you can never know for sure what it's hidden state was.

Â The best you can do is to infer probability distribution over the space of

Â all possible hidden state vectors. You can know that it's probably in some

Â part of the space and not another part of the space, but you can't pin it down

Â exactly. So with a generative model like this, if

Â you get to observe what it produces, and you now try to infer what the hidden state

Â was, in general that's very hard, but there's two types of hidden state model

Â for which the computation is tractable. That is, there's a fairly straightforward

Â computation that allows you to infer the probability distribution over the hidden

Â state vectors that might have caused the data.

Â Of course when we do this and apply it to real data.

Â We're assuming that the real data is generated by our model.

Â So that's typically what we do when we're modeling things.

Â We assume the data was generated by the model and then we infer what state the

Â model must have been in, in order to generate that data.

Â 5:23

The next three slides are mainly intended for people who already know about the two

Â types of hidden state model I'm going to describe.

Â The point of the slides is so that I make it clear how recurrent neural networks

Â differ from those standard models. If you can't follow the details of the two

Â standard models, don't worry too much. That's not the main point.

Â 5:50

So one standard model is a linear dynamical system.

Â It's very widely used in engineering. This is a generative model that has real

Â valued hidden state. The hidden state has linear dynamics,

Â shown by those red arrows on the right. And the dynamics has Gaussian noise, so

Â that the hidden state evolves probabilistically.

Â 6:15

There may also be driving inputs, shown at the bottom there, which directly influence

Â the hidden state. So the inputs, influence the hidden state

Â directly, the hidden state determines the output to predict the next output of a

Â system like this, we need to be able to infer its hidden state.

Â And these kinds of systems are used, for example, for tracking missiles.

Â In fact, one of the earliest uses of Gaussian distributions was for trying to

Â track planets from noisy observations. Gaussian actually figured out that, if you

Â assume Gaussian noise, you could do a good job of that.

Â 7:00

One nice property that a Gaussian has is that if you linearly transform a gaseon

Â you get another Gaussian. Because all the noise in a linear dynamic

Â system is gaseon. It turns out that the distribution over

Â the hidden state given the observation so far, that is given the output so far, is

Â also a Gaussian. It's a full covariance Gaussian, and it's

Â quite complicated to compute what it is. But it can be computed efficiently.

Â And there's a technique called Kalman Filtering.

Â This is an efficient recursive way of updating your representation of the hidden

Â state given a new observation. So, to summarize,

Â Given observations of the output of the system, we can't be sure what hidden state

Â it was in, but we can, estimate a Gaussian distribution over the possible hidden

Â states it might have been in. Always assuming, of course, that our model

Â is a correct model of the reality we're observing.

Â 8:06

A different kind of hidden state model that uses discrete distributions rather

Â than Gaussian distributions, is a hidden Markov model.

Â And because it's based on discrete mathematics, computer scientists love

Â these ones. In a hidden Markov model, the hidden state

Â consists of a one of N. Choice.

Â So there a number of things called states. And the system is always in exactly one of

Â those states. The transitions between states are

Â probabilistic. They're controlled by a transition matrix

Â which is simply a bunch of probabilities that say, if you're in state one at time

Â one, What's the probability of you going to

Â state three at time two? The output model is also stochastic.

Â So, the state that the system is in doesn't completely determine what output

Â it produces. There's some variation in the output that

Â each state can produce. Because of that, we can't be sure which

Â state produced a given output. In a sense, the states are hidden behind

Â this probabilistic veil, and that's why they're called hidden.

Â 9:19

Historically the reason hidden units in a neural network are called hidden, is

Â because I like this term. It sounded mysterious, so I stole it from

Â neural networks. It is easy to represent the probability

Â distribution across n states with n numbers.

Â So, the nice thing about a hidden Markov model, is we can represent the probability

Â distribution across its discreet states. So, even though we don't know what it,

Â what state it's in for sure, we can easily represent the probability distribution.

Â 9:53

And to predict the next output from a hidden Markov model, we need to infer what

Â hidden state it's probably in. And so we need to get our hands on that

Â probability distribution. It turns out there's an easy method based

Â on dynamic programming that allows us to take the observations we've made and from

Â those compute the probability distribution across the hidden states.

Â Once we have that distribution, there is a nice elegant learning algorithm hidden

Â Markov models, and that's what made them so appropriate for speech.

Â And in the 1970s, they took over speech recognition.

Â 10:30

There's a fundamental limitation of HMMs. It's easiest to understand this

Â limitation, if we consider what happens when a hidden Markov model generates data.

Â At each time step when it's generating, it selects one of its hidden states.

Â So if it's got n hidden states, the temporal information stored in the hidden

Â state is at most logn) n bits. So that's all it knows about what it's

Â done so far. So now let's consider how much information

Â a hidden Markov model can convey to the second half of an utterance it produces

Â from the first half. So imagine it's already produced the first

Â half of an utterance. And now it's going to have to produce the

Â second half. And remember, its memory of what it said

Â for the first half is in which of the n states it's in.

Â So its memory only has log n bits of information in it.

Â To produce the second half that's compatible with the first half, we must

Â make the syntax fit. So for example, the number intend must

Â agree. It also needs to make the semantics fit.

Â It can't have the second half of the sentence be about something totally

Â different from the first half. Also the intonation needs to fit so it

Â would look very silly if the, intonation contour completely changed halfway through

Â the sentence. There's a lot of other things that also

Â have to fit. The accent of the speaker,

Â The rate they're speaking at, How loudly they're speaking.

Â And the vocal tract characteristics of the speaker.

Â All of those things must fit between the second half of the sentence and the first

Â half. And so if you wanted a hidden Markov model

Â to actually generate a sentence, the hidden state has to be able to convey all

Â that information from the first half to the second half.

Â 12:28

Now the problem is that all of those aspects could easily come to a hundred

Â bits of information. So the first half of the sentence needs to

Â convey a hundred bits of information to the second half and that means that the

Â hidden Markov model needs two to the hundreds states and that's just too many.

Â 12:46

So that brings us to recurrence your own networks.

Â They have a much more efficient way of remembering information.

Â They're very powerful because they combine two properties that have distributed

Â hidden state. That means, several different units can be

Â active at once. So they can remember several different

Â things at once. They don't just have one active unit.

Â They're also nonlinear. You see, a linear dynamical system has a

Â whole hidden state vector. So it's got more than one value at a time,

Â but those values are constrained to act in a linear way so as to make inference easy,

Â and in a recurrent neural network we allow the dynamics to be much more complicated.

Â 13:34

With enough neurons and enough time, a recurring neuron network can compute

Â anything that can be computed by your computer.

Â It's a very powerful device. So linear dynamical systems and hidden

Â mark off models are both stochastic models.

Â That is the dynamics and the production of observations from the underlying state

Â both involve intrinsic noise. And the question is do models need to be

Â like that. Well one thing to notice is that the

Â posterior probability distribution over hidden states in either a limited anomical

Â system or hidden markoff model is a deterministic function of the data that

Â you've seen so far. That is the inference algorithm for these

Â systems ends up with a probability distribution, and that probability

Â distribution is just a bunch of numbers, and those numbers are a deterministic

Â version of the data so far. In a recurrent neural network, you get a

Â bunch of numbers that are a deterministic function of the data so far.

Â And it might be a good idea to think of those numbers that constitute the hidden

Â state of a recurrent neural network. They're very like the probability

Â distribution for these simple stocastic models.

Â 15:09

Well, they can oscillate. That's obviously good for things like

Â motion control, where when you're walking, for example, you want to know regular

Â oscillation, which is your stride. They can settle to point attractors.

Â That might be good for retrieving memories.

Â And later on in the course we'll look at Hopfield nets where they use the settling

Â to point attractors to store memories. So the idea is you have a sort of rough

Â idea of what you're trying to retrieve. You then let the system settle down to a

Â stable point and those stable points correspond to the things you know about.

Â And so by settling to that stable point you retrieve a memory.

Â 15:50

They can also behave chaotically if you set the weights in the appropriate regime.

Â Often, chaotic behavior is bad for information processing, because in

Â information processing, you want to be able to behave reliably.

Â You want to achieve something. There are some circumstances where it's a

Â good idea. If you're up against a much smarter

Â adversary, you probably can't outwit them, so it might be a good idea just to behave

Â randomly. And one way to get the appearance of

Â randomness is to behave chaotically. One nice thing about R and N's, which, a

Â long time ago, I thought was gonna make them very powerful, is that an R and N

Â could learn to implement lots of little programs, using different subsets of its

Â hidden state. And each of these little programs could

Â capture a nugget of knowledge. And all of these things could run in

Â parallel, and interact with each other in complicated ways.

Â 16:50

Unfortunately the computational power of recurrent neural networks makes them very

Â hard to train. For many years, we couldn't exploit the

Â computational power of recurrent neural networks.

Â It was some heroic efforts. For example, Tony Robinson managed to make

Â quite a good speech recognizer using recurrent nets.

Â He had to do a lot of work implementing them on a parallel computer built out of

Â transputers. And it was only recently that people

Â managed to produce recurrent neural networks that outperformed Tony Robinson's

Â