In this video we are going to talk about one of

the classification algorithms called the Naïve Bayes classifier.

Let's take a case to set up the scenario.

This is the task of classifying text search queries.

So suppose you are interested in classifying search queries and you have three classes.

The Entertainment class, the Computer Science class,

and the Zoology class.

The three subjects. And then you

know that most of these search queries are about Entertainment.

So that's what you know coming in.

That if you don't know anything else,

the chances that it's an entertainment related query is pretty high.

Then you get the query.

And the query is "Python".

And the task is still the same.

The task is, can you classify Python as entertainment,

computer science, or zoology?

Is it Python the snake?

If that is the case then it is a zoology document.

But what it is actually Python the programming

language like apply data mining in Python, right?

Then it belongs to computer science.

But if it is Python as in Monty Python then it is entertainment.

So just the word "Python" could still mean one of these three.

But then you think that Python as itself,

the most common class becomes zoology.

So even though generally the queries are entertainment queries,

when you see the word "Python" and if that is your query then

the chances that it is actually zoology becomes higher.

Then you have another word and this time is "Python download".

And then you can say that the most probably class is in fact computer science.

So what just happened there?

You had a model,

in this case a probabilistic model that tells

you the likelihood of a class before you have any information.

And then when you are given new information,

you updated that likelihood of the class.

So you had a model that said "Entertainment is typically what any standard query,

typical query, would be".

But then given the new information about the word "Python" you said "Oh,

the likelihood of it being zoology is higher".

And it's not likely to be entertainment anymore.

Even though there are cases of Python,

as in Monty Python, where it would be entertainment.

So this change is what is in the crux of this Naïve Bayes classifier.

You have something called prior probability that is the prior knowledge or

the prior belief that the label belongs to entertainment,

Or the label it's computer science,

or the label is zoology.

This is one of the three options you have.

So you have a prior probability for each of them.

And by basic probability you know that these are the only three options.

So the sum of these two probabilities should be one.

It's definitely one of the three.

It's either entertainment, or computer science, or zoology.

But among the three entertainment is more likely.

And then when you have new information like the x is Python and input is Python,

so when you have new information your probability and likelihood changes.

Your probability of entertainment given the input is Python is suddenly lower.

And the probability of y being computer science given x is Python is higher.

And probability of y given zoology is even higher.

That's why you would say that "Given the input Python,

the label is zoology".

So this is encapsulated in what is called Baye's Rule or Based

theorem and that says that "The posterior probability depends on your prior.

But then also depends on the likelihood of that happening or that event happening".

And divided by the evidence.

So in mathematical terms it comes to probability of y given

X is probability of Y which is the prior probability.

And probability of X given y which is the likelihood of having the data as X given y.

That means, if you know that the label is zoology,

the probability of seeing Python is so in zoology documents let's say.

Or if you know that the label is entertainment the,

the probability of seeing Python in entertainment is so and so and that is lower.

That's significantly lower than the other one.

Then if the class was zoology.

So the Naïve Bayes classifier looks at this computation,

looks at what is the probability of

a class like computer science given the input as Python.

And if computes is saying "What is the chance that it was coming to science in general?

What is the likelihood of it being computer science or the prior probability?

And then what is the likelihood of seeing the word

Python in documents that are computer science.

The same thing with zoology.

So the probability of the class being zoology given Python is,

what is a prior probability of the class being zoology without knowing any information?

And then given that it is the zoology class document,

what is the chance that you will see?

What is the likelihood of seeing Python?

And then you will say that if probability of y as CS

given Python is higher than y as zoology,

the label as zoology given Python then I'm going to call the label as computer science.

So in general, you're saying that probability of y given X is computed this way.

But then the Naïve Bayes classification model just

is interested in which of the three labels is more likely.

So you know that y belongs to one of the three classes.

Entertainment, conputer science, or zoology.

It's only important to know which among those three is higher.

And so, the true label,

or the predicted label I should say,

y* is the y that maximizes probabilit of y given X.

And then that computation it does not matter what the probability of X itself is.

So what is the probability of seeing a query like Python.

And you can remove it then because it does not matter,

it doesn't change with the label assigned to it.

This in addition to what is called the Naïve assumption

of Bayesian classification forms the Naïve Bayes classifier.

And the Naïve assumption is that given the class label,

the features themselves are independent of each other.

That is given the label is y,

probability of capital X given y is just individual feature probabilities.

Probability of x_i given a product of all of those.

That's the product that goes from the first feature to the end feature.

This is the final formulation of a Naïve Bayes classifier.

So the formula stands like this.

The predicted label y is the y that maximizes,

the argument that maximizes this computation of probability of y given X.

Which is computed using Bayes Rule as probability of y, that is the prior,

times T independent products of individual features given y.

So that's the likelihood of probatility of capital X given y.

So for example, if the query is Python download,

you're going to say "The predicted y is the y that maximizes probability of y.

Probability of Python given y,

and probability of download given y.

So for example, if it is zoology,

you know that probability of zoology is low.

People don't typically ask zoology queries.

But then probability of Python given zoology is very high.

However, probability of download given Python is very low.

However, in the case of computer science,

probability of computer science queries in general is somewhere in the middle.

Probability of Python given computer sciences not up there but also significant.

Whereas probability of download given computer science is very significant.

And a product of all of the three makes computer science as

to be the best predicted leap.

So in Naïve Bayes we saw the model is

just individual probabilities that you multiply together.

So what are the parameters there?

We have the prior probabilities.

The prior probabilities are

these probabilities for each of the classes in your set of classes.

So that's probability of y for all y in capital Y.

And then you have the likelihoods.

And that this probability of seeing a particular feature in documents of class y.

The probability of x_i given y that is import, no,

that is needed for all features x_i and all labels y in Y.

So it's a combination of every feature in each of these classes that you have.

So let's take an exercise.

If you have three classes that is capital Y is three.

And you have hundred features in your X,

that means X goes from X1,

X2 up to X100.

Can you compare how many parameters are there in the Naïve Bayes model? Give it a try.