0:09

In this video we are going to talk about one of

Â the classification algorithms called the NaÃ¯ve Bayes classifier.

Â Let's take a case to set up the scenario.

Â This is the task of classifying text search queries.

Â So suppose you are interested in classifying search queries and you have three classes.

Â The Entertainment class, the Computer Science class,

Â and the Zoology class.

Â The three subjects. And then you

Â know that most of these search queries are about Entertainment.

Â So that's what you know coming in.

Â That if you don't know anything else,

Â the chances that it's an entertainment related query is pretty high.

Â Then you get the query.

Â And the query is "Python".

Â And the task is still the same.

Â The task is, can you classify Python as entertainment,

Â computer science, or zoology?

Â Is it Python the snake?

Â If that is the case then it is a zoology document.

Â But what it is actually Python the programming

Â language like apply data mining in Python, right?

Â Then it belongs to computer science.

Â But if it is Python as in Monty Python then it is entertainment.

Â So just the word "Python" could still mean one of these three.

Â But then you think that Python as itself,

Â the most common class becomes zoology.

Â So even though generally the queries are entertainment queries,

Â when you see the word "Python" and if that is your query then

Â the chances that it is actually zoology becomes higher.

Â Then you have another word and this time is "Python download".

Â And then you can say that the most probably class is in fact computer science.

Â So what just happened there?

Â You had a model,

Â in this case a probabilistic model that tells

Â you the likelihood of a class before you have any information.

Â And then when you are given new information,

Â you updated that likelihood of the class.

Â So you had a model that said "Entertainment is typically what any standard query,

Â typical query, would be".

Â But then given the new information about the word "Python" you said "Oh,

Â the likelihood of it being zoology is higher".

Â And it's not likely to be entertainment anymore.

Â Even though there are cases of Python,

Â as in Monty Python, where it would be entertainment.

Â So this change is what is in the crux of this NaÃ¯ve Bayes classifier.

Â You have something called prior probability that is the prior knowledge or

Â the prior belief that the label belongs to entertainment,

Â Or the label it's computer science,

Â or the label is zoology.

Â This is one of the three options you have.

Â So you have a prior probability for each of them.

Â And by basic probability you know that these are the only three options.

Â So the sum of these two probabilities should be one.

Â It's definitely one of the three.

Â It's either entertainment, or computer science, or zoology.

Â But among the three entertainment is more likely.

Â And then when you have new information like the x is Python and input is Python,

Â so when you have new information your probability and likelihood changes.

Â Your probability of entertainment given the input is Python is suddenly lower.

Â And the probability of y being computer science given x is Python is higher.

Â And probability of y given zoology is even higher.

Â That's why you would say that "Given the input Python,

Â the label is zoology".

Â So this is encapsulated in what is called Baye's Rule or Based

Â theorem and that says that "The posterior probability depends on your prior.

Â But then also depends on the likelihood of that happening or that event happening".

Â And divided by the evidence.

Â So in mathematical terms it comes to probability of y given

Â X is probability of Y which is the prior probability.

Â And probability of X given y which is the likelihood of having the data as X given y.

Â That means, if you know that the label is zoology,

Â the probability of seeing Python is so in zoology documents let's say.

Â Or if you know that the label is entertainment the,

Â the probability of seeing Python in entertainment is so and so and that is lower.

Â That's significantly lower than the other one.

Â Then if the class was zoology.

Â So the NaÃ¯ve Bayes classifier looks at this computation,

Â looks at what is the probability of

Â a class like computer science given the input as Python.

Â And if computes is saying "What is the chance that it was coming to science in general?

Â What is the likelihood of it being computer science or the prior probability?

Â And then what is the likelihood of seeing the word

Â Python in documents that are computer science.

Â The same thing with zoology.

Â So the probability of the class being zoology given Python is,

Â what is a prior probability of the class being zoology without knowing any information?

Â And then given that it is the zoology class document,

Â what is the chance that you will see?

Â What is the likelihood of seeing Python?

Â And then you will say that if probability of y as CS

Â given Python is higher than y as zoology,

Â the label as zoology given Python then I'm going to call the label as computer science.

Â So in general, you're saying that probability of y given X is computed this way.

Â But then the NaÃ¯ve Bayes classification model just

Â is interested in which of the three labels is more likely.

Â So you know that y belongs to one of the three classes.

Â Entertainment, conputer science, or zoology.

Â It's only important to know which among those three is higher.

Â And so, the true label,

Â or the predicted label I should say,

Â y* is the y that maximizes probabilit of y given X.

Â And then that computation it does not matter what the probability of X itself is.

Â So what is the probability of seeing a query like Python.

Â And you can remove it then because it does not matter,

Â it doesn't change with the label assigned to it.

Â This in addition to what is called the NaÃ¯ve assumption

Â of Bayesian classification forms the NaÃ¯ve Bayes classifier.

Â And the NaÃ¯ve assumption is that given the class label,

Â the features themselves are independent of each other.

Â That is given the label is y,

Â probability of capital X given y is just individual feature probabilities.

Â Probability of x_i given a product of all of those.

Â That's the product that goes from the first feature to the end feature.

Â This is the final formulation of a NaÃ¯ve Bayes classifier.

Â So the formula stands like this.

Â The predicted label y is the y that maximizes,

Â the argument that maximizes this computation of probability of y given X.

Â Which is computed using Bayes Rule as probability of y, that is the prior,

Â times T independent products of individual features given y.

Â So that's the likelihood of probatility of capital X given y.

Â So for example, if the query is Python download,

Â you're going to say "The predicted y is the y that maximizes probability of y.

Â Probability of Python given y,

Â and probability of download given y.

Â So for example, if it is zoology,

Â you know that probability of zoology is low.

Â People don't typically ask zoology queries.

Â But then probability of Python given zoology is very high.

Â However, probability of download given Python is very low.

Â However, in the case of computer science,

Â probability of computer science queries in general is somewhere in the middle.

Â Probability of Python given computer sciences not up there but also significant.

Â Whereas probability of download given computer science is very significant.

Â And a product of all of the three makes computer science as

Â to be the best predicted leap.

Â So in NaÃ¯ve Bayes we saw the model is

Â just individual probabilities that you multiply together.

Â So what are the parameters there?

Â We have the prior probabilities.

Â The prior probabilities are

Â these probabilities for each of the classes in your set of classes.

Â So that's probability of y for all y in capital Y.

Â And then you have the likelihoods.

Â And that this probability of seeing a particular feature in documents of class y.

Â The probability of x_i given y that is import, no,

Â that is needed for all features x_i and all labels y in Y.

Â So it's a combination of every feature in each of these classes that you have.

Â So let's take an exercise.

Â If you have three classes that is capital Y is three.

Â And you have hundred features in your X,

Â that means X goes from X1,

Â X2 up to X100.

Â Can you compare how many parameters are there in the NaÃ¯ve Bayes model? Give it a try.

Â 11:14

No? Once you know what are the parameters that you learned in a NaÃ¯ve Bayes model,

Â let's see how do you actually learned it.

Â So you have prior probabilities like probability of y for all Y in the set of labels.

Â How do you identify that?

Â How do you know that the most common class of search queries is entertainment?

Â Well, you have the training data.

Â So you have the set of documents are

Â all the queries that are labeled as which class they belong to.

Â So you have a query such as

Â Ashton Kutcher and that query would be entertainment, or Lady Gaga.

Â That is also entertainment.

Â Or tiger.

Â And in the context of zoology as an animal and that would be zoology.

Â Or something like C++ and that is computer science and so on.

Â So you can just count the number of instances

Â you have in your training data in each of these classes.

Â In the example I gave,

Â I gave you two examples of entertainment and one each of computer science and zoology.

Â So your class distribution just with four instances would be half for entertainment,

Â one quarter for zoology,

Â and one quarter for computer science.

Â In general, if there are any instances and small enough those are belonging to

Â a particular class y then your probability of class y is small and or capital N.

Â The next set of parameters you have is likelihood.

Â So recall that likelihood is probably of x_i given

Â Y for all features x_i and for all labels y.

Â So it's a combination of the many, many features here.

Â So how do you count those?

Â So the way you do that would be you count the number of times

Â feature x_i appears in instances labeled as class y.

Â So you only focus on the instances that were labeled

Â class y and among those see how many times the feature x_i occurs.

Â So for example, you will only look at documents that belong to

Â the computer science class and among those you would count how many times Python appears.

Â OK? Or you would look at all the queries that we will call

Â entertainment and then look for word like

Â "actor" and see how many times the word "actor" occurs in those queries.

Â So that would be probability of word "actor" given the class entertainment.

Â So there are p instances of class y and x_i

Â the feature x_i appears in k of them then the probability of x_i and

Â y would be k / p. When you do this counting there is a slight problem.

Â And that problem is,

Â what happens if the probability becomes zero? So for example.

Â If you have never seen the word "actor" in queries that were labeled entertainment.

Â Or let's say you never saw the word "python"

Â in the queries that were labeled entertainment.

Â Because for some reason in your dataset you never saw Monty Python.

Â Then the probability the way you compute it would be zero.

Â But that is a problem because if you have the probability to

Â be zero and it is multiplied to all the other probabilities,

Â the overall probability will end up being zero.

Â So when a feature x_i never occurs documents labelled y,

Â it can never be the case that whenever you see the word,

Â the feature x_i that the label is y.

Â You don't want to be that strict.

Â So you don't want to say that

Â the posterior probability of y given x_i will be zero because that

Â would make any document that has Python to

Â never be called an entertainmenet class, right?

Â So what you would want to do instead is to smooth

Â the parameters and then you smooth the parameters.

Â One of the options to do it would be just add a dummy count.

Â You say, it's never the case that the count to zero.

Â So I add count of one to every word in every class.

Â This does not change the order probability significantly

Â because that's the way we are computed these probabilities.

Â These are called "maximum likelihood estimations" that typically good

Â when you have large numbers but they're not so good when you have very small numbers.

Â So by adding count of one,

Â you don't really significantly change the larger numbers there.

Â Because what you do now is you say,

Â probability of X_i given y instead of k / p,

Â you're going to say "I'm going to add one to every word".

Â So it's going to be k + 1 over p plus n because in all I have added n words as dummies.

Â So that would be your way by which now you'll never have a probability of zero here.

Â Because if k indeed was zero,

Â the probability would still be one over p + n. So this way,

Â how can I've awarded this problem of zero counts?

Â Now, you might imagine and wonder if you would

Â want to do something similar to the prior probabilities.

Â Think about it and we'll talk about it in one of the queries.

Â The immediate questions soon.

Â So to bring it all together.

Â The big take home messages from this video is that Naive Bayes is

Â a probabilistic model and it is called Naive because it assumes

Â that features are independent of each other given the class label.

Â It is Naive because it's actually not necessarily true even for text.

Â So for example, the fact that you have "White House" as

Â two words but together having a significantly different meaning.

Â If you see the word "White" with a capital W,

Â the chance that it is followed by "House" is very high.

Â So these two features of "White" and "House" are not really independent of each other.

Â But NaÃ¯ve Bayes models kind of ignored that.

Â They say that given the class label,

Â I'm assuming all features are independent because the math

Â on how you can compute the probabilities becomes easy.

Â Even so, for text classification problems,

Â NaÃ¯ve Bayes models actually typically provide very strong baselines.

Â It is traditionally the first one you should try because.

Â It will give you the benchmark or a baseline that is pretty strong.

Â And then you can look at other models that you see

Â soon for how well it improves over NaÃ¯ve Bayes models.

Â It's a very simple model to learn.

Â The parameters are very easy to understand and to learn because they are just counts.

Â Ad they are just counts counted in different ways and so on.

Â So the big message is NaÃ¯ve Bayes is

Â a very important classifier

Â for text classification and should be the first one you should try.

Â