[SOUND] In this video,

we'll finally see the Latent Dirichlet Allocation.

Let me remind you what topics are in documents.

So a document is a distribution over topics.

For example, we can assign for a document, a distribution like this.

So 80% cats and 20% dogs.

Another topic is a distribution over words.

For example, the topic cats would have 40% times the word cats and

the word meow 30% and the other words like dogs and

words like and these on will have really low probability.

The topic about dogs would have dog and

woof words whereas with and others with low.

Let's see how we can generate the for example the cat meowed on the dog.

The first word, the cat word is taken from the topic cats.

And with 40% probability we could sample the word cat.

The second word, meow, is also from the topic cats.

And it's sampled with 30% probability from the topic cats.

And finally, the word dog is from the topic on dogs, and

with 40% probability we could sample it.

So here's our model.

We have a distribution of our topics for the document number d,

we will call it theta d.

Then, for each word in the document,

we assign the probability, we assign the topic of each word.

For example, Zed D1, would respond to the topic of the first word in document D.

And finally, for example, the little variable Zed DN,

would respond to the Topic of the nth word in document d.

Each latent variable can take the values from 1 to T,

where T is the number of topics that we will try to find in our corpus.

The corpus is a collection of the documents.

So learn from the corresponding topics we can sample the words.

So we'll sample the word, for

example, WD1 from the topic that D!.

And the words, we can take values from 1 to V, where V is the size of a category.

So what I draw now is actually a Bayesian network.

We can draw it using [INAUDIBLE] as follows.

So here's our Bayesian network in a [INAUDIBLE].

We have theta.

Those are, top rebuild is for document and

we repeat it three times, that means for each document.

The theater in part Z is the other topics of the words and

finally from the topics we generate the words.

And we repeat it N times and of course corresponding to each word.

The probability over w z and theta is written below.

Let's try to interpret each component of it.

The first says that for each document,

we generate topic probabilities from the probability p of theta d.

Then for each work in this document we select

a topic with probability p of Z D N, given theta D.

And finally when we have a topic we start a word from

this topic, this is probability of the word WDN,

given that DN And so here's our final model.

So now we need to define these three probabilities.

The probability of theta, there's Z theta and.

The probability or theta, is modelled as I just said, the distribution with some of

the parameter of alpha Here's actual initial choice, since the components

of theta should sum up to one, and we need some distribution [INAUDIBLE].

And now we've only seen the [INAUDIBLE] distribution.

The probability of the topics given the theta would actually

be equal to the component of this structure theta.

The component theta d is that idea.

So, this narration is bit complex but actually it is quite logical.

So we just take the component of the vector of d,

responding to the current topic.

All right, and finally we need to select the words.

And to select the words,

we need to know the probabilities of the words in the corresponding topic.

That is, we should somehow find the topics.

We will sort those probabilities in the matrix file and

the corresponding probability can be found in the Row number Z ten and

column number WGM.

And so actually our goal would be to find this matrix.

We have a few constraints on this so first of all it should be non-negative

since we're modeling probabilities and also it should sum up to one.

All right, so here are four variables.

We have the data that is known,

we have a matrix file that is unknown and we want to try to find it.

And also we have latent variables zee and theta.

We'll also try to find distribution to them.

[SOUND]