0:02

Word Alignments Models.

Â This is a super important subtask in machine translation,

Â because different languages have different word order,

Â and we need to learn that from data.

Â So, we need to build some word alignment models and this video is exactly about them.

Â Let us go a bit more formal and let us

Â see what are the notations of every object that we have here.

Â So, e is the sentence e1,

Â e2 and so on and f is another sentence.

Â So, the length of e sentence is I and the length of f sentence is J.

Â Now, I need to learn some alignments between them,

Â which I denote by a.

Â And importantly, you'll see that I say that e is source and f is target.

Â Why do I say so?

Â Well, usually, we talked about

Â machine translation system from French to English or from foreign to English.

Â Why do I say now that it is vice versa?

Â This is because we applied base rule.

Â So, if you remember,

Â we did this to have our decoupled model about language and about translation.

Â And now, to build the system that translates from f to e,

Â we need to model the probability of f given e. Now,

Â what about word alignments?

Â How do we represent them?

Â So, the matrix of word alignments is one nice way to do this.

Â You have one sentence and another sentence,

Â and you have zeros or ones in the matrix.

Â So, you'll know which words correspond to each other.

Â Now, how many matrices do we have here?

Â Well, actually, it is a huge amount of matrices.

Â So, imagine you have two options in every element of the matrix and then,

Â you have the size of the matrix which is I multiplied by J,

Â so the number of possible matrices would be

Â two to the power of the size of the matrix and that's a lot.

Â So, let us do some constraints,

Â some simplifications to deal with this.

Â And what we do is we say that every target word is

Â allowed to be aligned only to one source word, okay?

Â Like here. So, this is a valid example.

Â Now, what would be the notation here.

Â So, we will have a1 which will

Â represent the number of the source word which is aligned to the first target word.

Â So, this is appetite and this is the second word.

Â Now, what would be a2?

Â So, a2 will be equal to three because we have

Â comes matched to [inaudible] which is the third word in our source.

Â Now, we can proceed and do the same for a4 and five, a6.

Â That's it. So, this is our notation.

Â Please keep it in mind not to get lost.

Â Now, let us build the probabilistic model.

Â Actually, this and the next slide will be about the sketch of the whole process.

Â So, we are going to build the probabilistic model and figure out how we learned that.

Â After that, we'll go into deeper details of this probabilistic model.

Â So, stay with me. We have our sentences,

Â e and f. So,

Â this is our observable variables.

Â Now, we have also word alignments.

Â We do not see them, but we need to model them somehow.

Â So, this is hidden variables.

Â And we have parameters of the model and this is actually the most creative step.

Â So, we need somehow to decide how do we parameterized

Â our model to have some meaningful generative story.

Â And if we have too many parameters,

Â probably, it will be difficult to train that.

Â If we have too less parameters,

Â probably it will be not general enough to describe all the data.

Â So, this is the moment that we will discuss in more details later.

Â But for now, let's just say that we have some probabilistic model of

Â f and a given e and theta. What do we do next?

Â Well, you should know that in all these situations,

Â we do likelihood maximization.

Â So, we take our data,

Â we write down the probability to see our data and we try to maximize this.

Â Now, one complicated thing with this

Â is that we do not see everything that we need to model.

Â So, we can model the probabilities of f and a,

Â but we don not see a.

Â That's why we need to sum over all possible word alignments.

Â And on the left-hand side,

Â you have the probability of f given all the rest things,

Â which is called incomplete data.

Â Likelihood maximization for incomplete data

Â means that there are some hidden variables that you do not see.

Â And this is a very bad situation.

Â So, imagine you have a logarithm.

Â So, you take logarithm and you have logarithm of the sum.

Â And you don't know how to maximize these,

Â how to take derivatives and how to get your maximum likelihood estimations.

Â But actually, we have already seen this case two times in our course.

Â So, one was Hidden Markov Model and another was topic models.

Â In both those cases,

Â we had some hidden variables and we have these incomplete data.

Â And in both cases we used EM-algorithm.

Â So, EM-algorithm just to recap,

Â is an iterative process that has E-step and M-step.

Â The E-step is about estimates for your hidden variables.

Â So, the E-step will be,

Â what are the best alignments that we can produce right now given our perimeters?

Â And the M-step is vice versa.

Â Given our best guess for word alignments,

Â what would be the updates for parameters that maximize our likelihood?

Â This is also so interesting to go into the exact formulas of EM-algorithm.

Â Better, let us discuss generative model because it is really creative thing.

Â Well, let us start with generating the length of the sentence.

Â So, J would be the length of the target sentence.

Â Once we could generate this,

Â let us say that we have independent susception by the target words.

Â So, we have this product by J which denotes the word in our target sentence.

Â Every word will be not modeled yet.

Â So first, real model the alignment for every position.

Â And then, we will model the exact word given that alignment.

Â So, if you are scared with this formula,

Â you can look into just green parts.

Â This is the most important thing.

Â You model alignments and you model words given these alignments.

Â All the other things that you see on the right

Â would be just everything that we know to condition on that.

Â And this is too much to condition on that because we will have well too much parameters.

Â So, we need to do some assumptions.

Â So, we need to say that not all those things are important in this conditions.

Â The first IBM model is the first attempt to simplify this generative story.

Â So, what it says is,

Â let us forget about the priors for word alignments,

Â let us have just a uniform prior.

Â And this prior will know nothing about the positions,

Â but it will have just one constant to tune. So, this is awesome.

Â Now, the translation table will be also very simple.

Â So, we will say that the probability to generate the word,

Â given that it is aligned to some source word,

Â is just the translation probability of that word given the source word.

Â So, how does that look like?

Â This is the translation table.

Â So, once we know that the word is aligned to some source word,

Â we just take this probability out of it.

Â So, this is a very simple model,

Â but it has a very big drawback.

Â It doesn't take into account the position of

Â your word to produce the alignment to this word.

Â So, the second IBM model tries to make better and it says,

Â "Okay, let us take J,

Â the position of the target word and let us use it to produce aj.",

Â the alignment for this target word.

Â Now, how many parameters do we have to learn to build this position-based prior.

Â Well, actually a lot of parameters.

Â So, you know what, you have I multiplied by J,

Â which will be the size of the matrix of probabilities and it is not all.

Â Apart of this matrix,

Â you will also have different matrices for different I and J.

Â So, you cannot just use one and the same matrix for all kind of sentences.

Â You just share this matrix across all sentences with given lengths.

Â But for sentences with different lengths,

Â you have different matrix.

Â So, this is a lot of parameters and to try to improve on that,

Â we can say, "Well,

Â the matrix is usually very close to diagonal.

Â What if we model it as a diagonal matrix?"

Â This is what Chris Dyer said in 2013.

Â So, this model has only one perimeter that says,

Â how is the probability mass spread around the diagonal?

Â And this is nice because it is still has some priors about positions,

Â but it has not too many parameters.

Â Now, I have the last example for you for alignments.

Â So, actually, you already know how to build this,

Â you just don't remember that.

Â We had Hidden Markov Models in our course and Hidden Markov Models

Â can help to build some transition probabilities. Why do we need it here?

Â So, imagine these couple of sentences and the phrase in

Â the beginning of the sentence can be aligned to the phrase in the end of the sentence.

Â But sometimes, inside this phrase,

Â you just go word-by-word so you do not have any gaps. And this is nice.

Â It means that you need to learn these and to

Â use this information that the previous word was

Â aligned to position five and maybe that means

Â that the next word should be aligned to position six.

Â So, this is what Hidden Markov Model can make for you.

Â So, you model the probability of the next alignment given the previous alignment.

Â So now, let us sum up what we have in this video.

Â So, we have covered IBM models,

Â which is a nice word-based technique to build machine translation systems.

Â And actually, there are lots of problems with them that we did not cover.

Â And there are IBM Model Number three and four and five that

Â can try to cope with the problem of fertility, for example,

Â saying that we need to explicitly model

Â how many output words are produced by each source word,

Â or that we need to explicitly deal with spurious word.

Â This are the words that just appear from nowhere in the target language.

Â