0:00

I'm Hanlin, from Intel.

Â In this lecture, we will discuss, recurrent neural networks.

Â We will start with a review of important topics, from previous lectures.

Â And then provide an overview,

Â of how our own ends operate.

Â We will then discuss word in beddings,

Â and then take a look, at some of the key,

Â neural network, architectures out there,

Â including Alice T.M.

Â and GRU networks.

Â As we now know,

Â training a neural network, consists of randomly,

Â initializing the weights, fetching a batch of data,

Â and then a forward, propagating it, through the network.

Â We can then compute a cost,

Â which will then be used,

Â to back propagate, and determine our weight updates.

Â How do we change our ways,

Â to reduce the cost?

Â In this example, the network is being trained,

Â to recognize, handwritten digits,

Â using the MNIS data set.

Â Back propagation uses a chain rule,

Â to compute the partial derivative,

Â of the cost, with respect to all the ways in the network.

Â Using grading descent, or a variant of gradient descent,

Â we can take a step,

Â towards a better set of weights, with the lower costs.

Â All of these operations, can be computed,

Â efficiently, as Matrix multiplications.

Â So why do we need recurrent neo networks?

Â Feed forward neural networks,

Â make a couple of assumptions.

Â For example, they Assume independence,

Â in the training example.

Â And that after, a training example has passed,

Â through the network, the state is then discarded.

Â All we care about, are the gradient networks.

Â But in many cases,

Â such as with the sequences of words,

Â there is temporal dependence,

Â and contextual dependence, that needs to be explicitly captured, by the model.

Â Also, most feed forward,

Â neuro networks, assume, an input vector, of a fixed length.

Â For example, in all of our previous, slides,

Â the images, were all fixed,

Â in size, across a batch.

Â However, text, or, speech,

Â can vary greatly in size,

Â across sometimes, in order of magnitude.

Â And that is a variability that cannot be captured,

Â by feed forward networks.

Â Instead, our own ends, introduce,

Â an explicit, modeling, of sequentiality.

Â Which allows it to both,

Â capture short range, and long range dependencies, and the data.

Â Through training, such a model,

Â can also learn, and adapt,

Â to the time scales, and the data itself.

Â So our own ends are often used, to handle,

Â any type of data, where we have variable sequence lengths, and, where this,

Â idea, of contextual dependence,

Â and temporal dependencies, that have to be captured,

Â in order, for the task to be accomplished successfully.

Â The building block, of recurrent neuro networks,

Â is a recurrent neuron.

Â What I am showing here, is,

Â a simple afine layer, similar,

Â to what I had shown previously, where,

Â the output of the unit, is the,

Â input, multiply, by a weight matrix.

Â To turn this, into a recurrent, neuron,

Â we add a recurrent connection,

Â such that, the activity,

Â at, a particular time t,

Â depends on, it's activity,

Â at the previous time,

Â T minus one, but multiply,

Â by a recurrent matrix,

Â W R. And so now we have our equations,

Â which were, exactly the same,

Â as what we had seen before,

Â except we have this additional term,

Â here, to represent, that the activity depends,

Â on the activity of the previous time step.

Â As determined, through the weight matrix,

Â W R. Additionally, we add,

Â an output, to this model,

Â y of T, which takes, the, activity,

Â of that recurring neuron,

Â and passes it through,

Â a matrix W Y,

Â to provide the output.

Â And so here you have, the fundamental,

Â building block, of, a recurrent neuron.

Â So how do we train, such a network?

Â We can unroll, their current neural network,

Â into a future forward one.

Â So here's what I mean by that.

Â Here is our same,

Â recurring neuron before, where we have it,

Â taking its input X of T, multiplying it,

Â by a way matrix, to provide,

Â h of t, this hidden representation.

Â And we can read out,

Â from this hidden representation,

Â by passing it through a way matrix,

Â W Y, to get the output W T. And importantly here,

Â we have an arrow, W R,

Â to represent the fact, that, the,

Â subsequent, time steps activity,

Â H of T plus one,depends,

Â on what came before it.

Â And so here you can see,

Â unrolling the network, into a feed forward Network, and Time,

Â except where, the way, W R,

Â that spans, these different layers,

Â in the feed forward network, are tied together.

Â But once we've, unfurled, this,

Â recurrent neural network, into a feed forward network,

Â we can apply, the same, back propagation,

Â equations, that I had shown,

Â previously, to train this network.

Â Now that you understand,

Â how an RNN works,

Â we can look at how they're used.

Â So here is an example,

Â where we use an R N, to learn, a language model.

Â Such that, it correctly, predicts,

Â the next letter, in the word,

Â based, on the previous letter.

Â So you can imagine a world,

Â where we have a vocabulary,

Â of forward letters, so the input here, is encoded.

Â So here we have the input, character age.

Â And we have the values, in these gray,

Â boxes, that represent, the hidden activations.

Â And, at every time,

Â we read out, a prediction,

Â of the next character.

Â So here, the model, is quite confident,

Â that the next character is going to be e. And as we move on,

Â and apply more and more characters, it can begin to,

Â predict, additional, characters, in the next time step.

Â So this is nothing more than a language model.

Â Where a model is learning to predict,

Â what happens next, based on,

Â the sequence of characters,

Â that it had seen before.

Â Here's an example, of that same language model,

Â but now apply, to words,

Â where you have the input, cash,

Â flow is high, and then a model, is predicting,

Â the next word, in that sentence,

Â based on what, has been seen, before.

Â You will also notice in this prediction,

Â that there is a level, of ambiguity, here.

Â When the model, sees, it's cash, flow,

Â is, depending on the state of the market,

Â you can see that its output,

Â predictions, in this green box,

Â here, are sort of,

Â split, between low, and high.

Â Which makes sense, as both, are, syntactically, correct.

Â The difference in probabilities, between the two,

Â may be related to the context of the text,

Â or previous data, that it has, been trained on.

Â Recurrent neo networks, and also be merged,

Â with convolutional neural networks,

Â to produce an image capturing network.

Â As seen here, after we provide,

Â the image as the input,

Â the network can learn,

Â to generate, a caption,

Â such as, a group of people,

Â shopping at an outdoor, market.

Â And here, the input to the RNN,

Â could be their feature representation,

Â in one of the last, few layers,

Â of the convolutional Neo network.

Â Training recurrent neo networks, is very,

Â similar, to that, of a feed forward network.

Â Using the chain rule, you can determine,

Â the partial of the cost,

Â with respect, to each weight, in the network.

Â In contrast, to feed forward networks,

Â now we have, costs,

Â associated, with every single timestamp.

Â And so, how we do, is, we combine,

Â the gradients, across, these time steps, as shown here.

Â The issue of vanishing, and exploding gradients,

Â can be quite, problematic,

Â for recurrent neo networks.

Â With especially, deep networks, the weight,

Â W R, is repetitively,

Â multiplied, over, at each time step.

Â So in that way, the magnitude,

Â of your gradients, is proportional, to the magnitude,

Â of W R, to the power,

Â of t. What this means, is that,

Â if the weight is greater than 1,

Â the gradients, can explode.

Â Whereas, if the gradients are less than one,

Â they can possibly vanish.

Â Thats going to also depend a lot,

Â on the activation function, in unhidden node.

Â Given, a rectified, linear unit,

Â It is easier, to imagine,

Â for example, an exploding gradient.

Â Whereas, with the sigmoid,

Â activation function, the vanishing gradient problem,

Â becomes, much more pressing.

Â It is actually, quite easy,

Â to detect, Exploding gradients, during training.

Â You simply, see, the cost, explode.

Â And to counteract, that, one can simply,

Â clip the gradients, inside a particular, threshold.

Â Thus, force, the gradients,

Â to be within a certain, bound.

Â Or additionally, optimization methods, such as, RMSprop,

Â can adaptively, adjust to learning rate,

Â depending, on the size, of the gradient itself.

Â Vanishing gradients, however, are more problematic to find.

Â Because, it is not always, obvious,

Â when they occur, or how to deal with them.

Â One of the more popular methods,

Â of addressing this issue,

Â is actually, not to use, the RNNs,

Â that I had introduced earlier,

Â but to use, LSTM and GRU networks.

Â And that's what we're going to discuss next.

Â So, one way, to combat, the,

Â Vanishing, or Exploding gradient problem, is to have,

Â this very simple model here,

Â where you have a unit,

Â that's connected to itself,

Â over time, with a weight of one.

Â So now you can see, that,

Â as you unroll, this network,

Â the, activity, at every time step,

Â is equal, to the activity,

Â at the previous time step.

Â So while, you no longer have,

Â a Vanishing, or Exploding gradient, here,

Â It's not a very interesting behavior,

Â because it just repeating itself,

Â over and over again,

Â as if it were, sort of a memory unit.

Â So, what we're going to do,

Â is, we're going to manipulate,

Â this memory unit, by adding, different operations.

Â So adding, the ability, to,

Â flush the memory, by rewriting it.

Â The ability, to, add to the memory,

Â and the ability, to read, from the memory.

Â So what we do, is,

Â we have, this, memory,

Â as I had shown you before, where,

Â each unit, was connected to the next one, with a weight of one.

Â And then, we attach,

Â what's shown here, as an output gate.

Â So, when we want to, read,

Â from a memory cell,

Â we take the activity,

Â of that memory cell, which is a vector,

Â and we pass it through, a tanh function,

Â and multiply, by a gate, as shown here,

Â as O of t. So this,

Â gate, is a vector of numbers,

Â between zero, and one.

Â And it controls, what exactly,

Â is emitted from the network.

Â The gate itself, is affine layer,

Â with a sigmoid activation function.

Â During training, the weights are learning to produce,

Â a right output, from the model,

Â from, the hidden state, of the network.

Â You can see this math, being shown,

Â in this equation here, where the,

Â output of the model, is,

Â an activation function, that wraps, an affine layer.

Â And then we do, an element wise, multiplication,

Â with the tanh, of the memory cell.

Â To make this, easier,

Â to express, we, represent, the output gate,

Â as O of t. So now you just have, O of t,

Â and then the Illinois operation,

Â with a tanh, of the activity from the memory cell.

Â And importantly, this output gate, is simply,

Â an affine layer, very similar,

Â to what we had introduced previously.

Â The forget gate, follows,

Â a similar, approach, to the output gate.

Â We use an affine layer,

Â with a different set of weights,

Â whose outputs, are between, zero and one.

Â And we, insert that gate,

Â as a multiplication, between,

Â the memory cell, at time t,

Â and a t time t plus one.

Â So if these values, are close to zero, the values,

Â in the memory cell, are forgotten,

Â from one time step to the next.

Â Or, if they are close to one,

Â they're being maintained, across time.

Â The Input gate, is used to write,

Â new data, to the memory cell.

Â And it has two components,an affine layer,

Â with weights, Wc, and a Tanh, activation function.

Â Which generates, a new,

Â proposed, input, into the memory cell.

Â And it also, has,

Â contain in it, in input gate,

Â as shown here, which modulates,

Â to propose, input, and then writes it,

Â to the memory cell, here.

Â So we can think of the next,

Â stage, of the LSTM,

Â C of t, plus one,

Â as, how much you want to, forget,

Â from the previous time step, plus, a proposal,

Â for the new time step input,

Â multiply, by how much we want to accept, this new proposal.

Â It is important to remember here, that,

Â all the values in the network,

Â are vectors, and not scalars.

Â So here is, the LSTM model,

Â with the forget gate,

Â the input gate, and the output gate.

Â And here's an example,

Â of an LSTM, where,

Â the value, will be recorded in the memory cell,

Â is the gender identity, of the speaker.

Â And you can imagine this,

Â being a very important value,

Â for conditioning, the predictions, of the model itself.

Â So you can see here, that,

Â when you encounter, the word Bob, you have learned,

Â to, forget that, previous,

Â activity, because, now you have a new gender, for the speaker.

Â And you will learn to, overwrite, that value,

Â with value of one,

Â to represent, a male speaker.

Â And then the model continues on, processing more data.

Â So in this scenario, the forget gate,

Â outputs zero, because you want to forget everything that came before.

Â And you wanna, input one,

Â into the model itself.

Â So in this example, the forget gate,

Â is zero, because we want to forget,

Â what had come before.

Â And then, the input gate, is one,

Â to represent, the fact that we have a male speaker,

Â in this particular sentence.

Â And that is important, because when we reach a prediction phase, we can use that,

Â to predict, his, instead of her,

Â as the next possible word, in the sentence.

Â Another, popular architecture, is Gated Recurring Units, or GRUs.

Â And they're essentially, a simplified, LSTM.

Â Here, all the gates,

Â are compressed, into one update gate.

Â And the input module,

Â is an affine layer,

Â that proposes the input,

Â which is combined, with an update gate,

Â to obtain, the representation, at the next time step.

Â The remember gate, controls, how much,

Â the previous, time representation, impacts your proposal.

Â We've seen in many scenarios, where, that GRU,

Â performed, similarly, to the LSTM.

Â So in that way, it is somewhat attractive,

Â because, of its more simplified, representation.

Â Bidirectional RNNs, are also recurring neural networks.

Â Except that, they connect, two hidden layers,

Â of opposite directions, of the same output.

Â With this structure, the upper layer, can get information,

Â from both the past, and, future states.

Â Additionally, you can stack, these bi RNNs,

Â on top of each other, to obtain,

Â more complex, abstractions, and features, of the input.

Â These, architectures, are often,

Â used, in speech applications.

Â Where they transcription, from a speech,

Â may depend, not just, on,

Â the, sound, that had come before,

Â but also, the audio afterwards.

Â A great application of this,

Â is the Deep Speech Two model,

Â which is a state of the art,

Â speech transcription model, that was,

Â published, by two, several years ago.

Â It is important to understand,

Â what LSTM units, learn,

Â after the training process is completed.

Â So, this is work, by Andre Carrpathi,

Â where he trained, a language model, Recurrent Neural Network,

Â on, several important texts,

Â such as, the war and peace,

Â novel, and also, a corpus of Linux Kernels.

Â And he has identified, individual,

Â cells, that are sensitive to particular properties.

Â So on the top here, you can see a cell that is sensitive,

Â to position, in the line itself.

Â Or a cell, that changes his activation,

Â based on, whether it encounters,

Â an inside quote, or not.

Â So, in all of these examples,

Â you have the text, and the caller,

Â represents the activity, of this particular, identified, unit.

Â You can see, in, the Linux Kernels,

Â dataset, we have cells,

Â that are robustly, activate, inside, if statements.

Â Or even cells, that turn, on,

Â comments, or quotes, in the code itself.

Â It is exciting, to see,

Â what we can build,

Â with the current neural networks, and,

Â the type of visualizations,

Â that will become available.

Â Try to understand, what,

Â these recurrent Neural Networks,

Â are learning, Isley, and just,

Â is a large corpus, of natural language.

Â