0:02

Hey, in this video, I'm going to cover one nice paper about summarization.

Â This is a very recent paper from Chris Manning Group,

Â and it is nice because it tells us that on the one hand,

Â we can use encoder-decoder architecture, and it will work somehow.

Â On the other hand, we can think a little bit and improve a lot.

Â So, the improvement will be based on pointer networks,

Â which are also a very useful tool to be aware of.

Â Also sometimes, we have

Â rather hand-wavy explanations of the architectures with the pictures.

Â Sometimes, it is good to go into details and to see some actual formulas.

Â That's why I want to be very precise in this video,

Â and in the end of this video,

Â you will be able to understand all the details of the architecture.

Â So, this is just a recap,

Â first of all that we have usually some encoder,

Â for example bidirectional LSTM and then we have some attention mechanism,

Â which means that we produce some probabilities that tells us

Â what are the most important moments in our input sentence.

Â Now, you see there is some arrow on the right of the slide.

Â Do you have any idea what does this arrow means?

Â Where does it comes from?

Â Well, the attention mechanism is about the important moments of

Â the encoder based on the current moment of the decoder.

Â So, now we definitely have the yellow part which is decoder,

Â and then the current state of this decoder tells us how to compute attention.

Â Just to have the complete scheme,

Â we can say that we use this attention mechanism to

Â generate our distribution or vocabulary.

Â Awesome. So, this is just a recap of encoder-decoder attention architecture.

Â Let us see how it works.

Â So, we have some sentences,

Â and we try to get a summary.

Â So, the summary would be like that.

Â First, we see some UNK tokens because the vocabulary is not big enough.

Â Then, we also have some problems in this paragraph that we will try to improve.

Â One problem is that the model is abstractive,

Â so the model generates a lot,

Â but it doesn't know that sometimes,

Â it will be better just to copy something from the input.

Â So, the next architecture will tell us how to do it.

Â Let us have a closer look into the formulas and then see how we can improve the model.

Â So, first, attention distribution.

Â Do you remember notation?

Â Do you remember what is H and what is S?

Â Well, H is the encoder states and S is the decoder states.

Â So, we use both of them to compute the attention weights,

Â and we apply softmax to get probabilities.

Â Then, we use these probabilities to weigh encoder states and get v_j.

Â v_j is the context vector specific for the position j over the decoder.

Â Then how do we use it?

Â We have seen in some other videos that we can use it

Â to compute the next state of the decoder.

Â In this model, we will go in a little bit more simple way.

Â Our decoder will be just normal RNN model,

Â but we will take the state of this RNN model s_j and

Â concatenate with v_j and use it to produce the probabilities of the outcomes.

Â So, we just concatenate them, apply some transformations,

Â and do softmax to get the probabilities of the words in our vocabulary.

Â Now, how can we improve our model?

Â We would want to have some copy distribution.

Â So, this distribution should tell us that

Â sometimes it is nice just to copy something from the input.

Â How can we do this?

Â Well, we have attention distribution that already

Â have the probabilities of different moments in the input.

Â What if we just sum them by the words?

Â So, for example, we have seen as two times in our input sequence.

Â Let us say the probability of as should be equal to the sum of those two.

Â And in this way,

Â we'll get some distribution over words that occurred in our input.

Â Now, the final thing to do will be just to have a mixture of those two distributions.

Â So, one is this copy distribution that tells that some words from the input are good,

Â and another distribution is our generative model that we have discussed before.

Â So just a little bit more formulas.

Â How do we weigh these two distributions?

Â We weigh them with some probability p generation here,

Â which is also sum function.

Â So every thing which is in green on this slide is some parameters.

Â So, you just learn these parameters and you learn to

Â produce this probability to weigh two kinds of distributions.

Â And this weighting coefficient depends on everything that you have,

Â on the context vector v_j,

Â on the decoder state s_j,

Â on the current inputs to the decoder.

Â So you just apply transformations

Â to everything that you have and then sigmoid to get probability.

Â The training objective for our model would be,

Â as usual, cross-entropy loss with this final distribution.

Â So, we will try to predict those words that we need to predict.

Â This is similar to likelihood maximization,

Â and we will need to optimize the subjective.

Â Now, this is just the whole architecture, just once again.

Â We have encoder with attention,

Â we have yellow decoder,

Â and then we have two kinds of distributions that we

Â weigh together and get the final distribution on top.

Â Let us see how it works.

Â This is called pointer-generation model because it has two pieces,

Â generative model and pointer network.

Â So this part about copying

Â some phrases from the input would be called pointer network here.

Â Now, you see that we are good,

Â so we can learn to extract some pieces from the text,

Â but there is one drawback here.

Â So you see that the model repeats some sentences or some pieces of sentences.

Â We need one more trick here,

Â and the trick will be called coverage mechanism.

Â Remember you have attention probabilities.

Â You know how much attention you give to every distinct piece of the input.

Â Now, let us just accumulate it.

Â So at every step,

Â we are going to sum all those attention distributions to some coverage vector,

Â and this coverage vector will know that certain pieces

Â have been attended already many times.

Â How do you compute the attention then?

Â Well, to compute attention,

Â you would also need to take into account the coverage vector.

Â So the only difference here is that you have one more term there,

Â the coverage vector multiplied by some parameters,

Â green as usual, and this is not enough.

Â So you also need to put it to the loss.

Â Apart from the loss that you had before,

Â you will have one more term for the loss.

Â It will be called coverage loss and the idea is to

Â minimize the minimum of the attention probabilities and the coverage vector.

Â Take a moment to understand that.

Â So imagine you want to attend some moment that has been already attended a lot,

Â then this minimum will be high and you will want to minimize it.

Â And that's why you will have to have small attention probability at this moment.

Â On the opposite, if you have some moment with low coverage value,

Â then you are safe to try to have

Â high attention weight here because the minimum will be still the low coverage value,

Â so the loss will not be high.

Â So this loss motivates you to attend those places that haven't been attended a lot yet.

Â Let us see whether the model works nice and whether

Â the coverage trick helps us to avoid repetitions.

Â We can compute the ratio of duplicates in our produced outcomes,

Â and also we can compute the same ratio for human reference summaries,

Â and you can see that it is okay to duplicate unigrams,

Â but it is not okay to duplicate sentences

Â because the green level there is really low, it is zero.

Â So the model before coverage, the red one,

Â didn't know that and it duplicated a lot of three-grams and four-grams and sentences.

Â The blue one doesn't duplicate that,

Â and this is really nice.

Â However, we have another problem here.

Â The summary becomes really extractive,

Â which means that we do not generate new sentences,

Â we just extract them from our input.

Â Again, we can try to compare what we have with reference summaries.

Â Let us compute the ratio of those n-grams that are novel.

Â And you can see that for the reference summaries,

Â you have rather high bars for all of them.

Â So, the model with coverage mechanism has

Â sufficiently lower levels than the model without the coverage mechanism.

Â So in this case, our coverage spoils a model a little bit.

Â And again for the real example,

Â this is the summary generated by pointer-generator network plus coverage,

Â and actually let us see.

Â Somebody says he plans to something.

Â And here in the original text,

Â we see exactly the same sentences but they are somehow linked.

Â So, we just link them with he says that and so on.

Â Otherwise, it is just extractive model that extracts these three important sentences.

Â Now, I want to show you quantitative comparison of different approaches.

Â ROUGE score is an automatic measure for summarization.

Â You can think about it as something as BLEU,

Â but for summarization instead of machine translation.

Â Now, you can see that pointer-generator networks

Â perform better than vanilla seq2seq plus attention,

Â and coverage mechanism improves the system even more.

Â However, all those models are not that good if we compare them to some baselines.

Â One very competitive baseline would be just to take first three sentences over the text.

Â But it is very simple and extractive baseline,

Â so there is no idea how to improve it.

Â I mean, this is just something that you get out of this very straightforward approach.

Â On the contrary, for those models for attention and coverage,

Â there are some ideas how to improve them even more,

Â so in future everybody hopes that neural systems will be able to improve on that,

Â and it is absolutely obvious that in a few years,

Â we will be able to beat those baselines.

Â