0:02

Hey. Attention mechanism is a super powerful technique in neural networks.

Â So let us cover it first with some pictures and then with some formulas.

Â Just to recap, we have some encoder that has h states and decoder that has some s states.

Â Now, let us imagine that we want to produce the next decoder state.

Â So we want to compute sj. How can we do this?

Â In the previous video,

Â we just used the v vector,

Â which was the information about the whole encoded input sentence.

Â And instead of that,

Â we could do something better.

Â We can look into all states of the encoder with some weights.

Â So this alphas denote some weights

Â that will say us whether it is important to look there or here.

Â How can we compute this alphas?

Â Well, we want them to be probabilities, and also,

Â we want them to capture some similarity between

Â our current moment in the decoder and different moments in the encoder.

Â This way, we'll look into more similar places,

Â and they will give us the most important information to go next with our decoding.

Â If we speak about the same thing with the formulas,

Â we will say that, now,

Â instead of just one v vector that we had before,

Â we will have vj, which is different for different positions of the decoder.

Â And this vj vector will be computed as an average of encoder states.

Â And the weights will be computed as soft marks because they need to be probabilities.

Â And this soft marks will be applied to similarities of encoder and decoder states.

Â Now, do you have any ideas how to compute those similarities?

Â I have a few.

Â So papers actually have tried lots and lots of different options,

Â and there are just three options for you to try to memorize.

Â Maybe the easiest option is in the bottom.

Â Let us just do dot product of encoder and decoder states.

Â It will give us some understanding of their similarity.

Â Another way is to say,

Â maybe we need some weights there,

Â some metrics that we need to learn,

Â and it can help us to capture the similarity better.

Â This thing is called multiplicative attention.

Â And maybe we just do not want to care at all with our mind how to compute it.

Â We just want to say, "Well,

Â neural network is something intelligent.

Â Please do it for us."

Â And then we just take one layer over

Â neural network and say that it needs to predict these similarities.

Â So you see there that you have h and s multiplied by some matrices and summed.

Â That's why it is called additive attention.

Â And then you have some non-linearity applied to this.

Â These are three options,

Â and you can have also many more options.

Â Now, let us put all the things together,

Â just again to understand how does attention works.

Â You have your conditional language modeling task.

Â You'll try to predict Y sequence given s sequence.

Â And now, you encode your x sequence to some vj vector,

Â which is different for every position.

Â This vj vector is used in the decoder.

Â It is concatenated with the current input of the decoder.

Â And this way, the decoder is aware of

Â all the information that it needs, the previous state,

Â the current input, and now,

Â this specific context vector,

Â computed especially for this current state.

Â Now, let us see where the attention works.

Â So neural machine translation had lots of problems with long sentences.

Â You can see that blue score for long sentences is really lower,

Â though it is really okay for short ones.

Â Neural machine translation with attention can solve this problem,

Â and it performs really nice for even long sentences.

Â Well, this is really intuitive because attention helps to

Â focus on different parts of the sentence when you do your predictions.

Â And for long sentences,

Â it is really important because, otherwise,

Â you have to encode the whole sentence into just one vector,

Â and this is obviously not enough.

Â Now, to better understand those alpha IJ ways that we have learned with the attention,

Â let us try to visualize them.

Â This weights can be visualized with I by J matrices.

Â Let's say, what is the best promising place in

Â the encoder for every place in the decoder?

Â So with the light dot here,

Â you can see those words that are aligned.

Â So you see this is a very close analogy to word alignments that we have covered before.

Â We just learn that these words are somehow similar,

Â relevant, and we should look into this once to translate them to another language.

Â And this is also a good place to note

Â that we can use some techniques from traditional methods,

Â from words alignments and incorporate them to neural machine translation.

Â For example, priors for words alignments can

Â really help here for neural machine translation.

Â Now, do you think that this attention technique is really

Â similar to how humans translate real sentences?

Â I mean, humans also look into some places and then translate this places.

Â They have some attention.

Â Do you see any differences?

Â Well, actually there is one important difference here.

Â So humans save time with attention because

Â they look only to those places that are relevant.

Â On the contrary, here,

Â we waste time because to guess what is the most relevant place,

Â we first need to check out all the places and compute

Â similarities for the whole encoder states.

Â And then just say, "Okay,

Â this piece of the encoder is the most meaningful."

Â Now, the last story for this video is how to

Â make this attention save time, not waste time.

Â It is called local attention,

Â and the idea is rather simple.

Â We say, let us first time try to predict what is the best place to look at.

Â And then after that,

Â we will look only into some window around this place.

Â And we will not compute similarities for the whole sequence.

Â Now, first, how you can predict the best place.

Â One easy way would be to say, "You know what?

Â Those matrices should be strictly diagonal,

Â and the place for position J should be J."

Â Well, for some languages,

Â it might be really bad if you have

Â some different orders and then you can try to predict it.

Â How do you do this?

Â You have this sigmoid for something complicated.

Â This sigmoid gives you probability between zero to one.

Â And then you scale this by the length of the input sentence I.

Â So you see that this will be indeed something in between zero and I,

Â which means that you will get some position in the input sentence.

Â Now, what is inside those sigmoid?

Â Well, you see a current decoder state sj,

Â and you just apply some transformations as usual in neural networks.

Â Anyway, so when you have this aj position,

Â you can just see that you need to look only into this window and

Â compute similarities for attention alphas as usual,

Â or you can also try to use some Gaussian to say that

Â actually those words that are in the middle of the window are even more important.

Â So you can just multiply some Gaussian priors

Â by those alpha weights that we were computing before.

Â Now, I want to show you the comparison of different methods.

Â You can see here that we have global attention and local attention.

Â And for local attention, we have monotonic predictions and predictive approach.

Â And the last one performs the best.

Â Do you remember what is inside the brackets here?

Â These are different ways to compute similarities for attention weights.

Â So you remember dot product and multiplicative attention?

Â And, also, you could have location-based attention,

Â which is even more simple.

Â It says that we should just take sj and use it to compute those weights.

Â This is all for that presentation,

Â and I am looking forward to see you in the next one.

Â