0:07

Okay. So now that we know what is a theoretical understanding of text classification,

Â let's see how to build one in Python.

Â There are quite a few toolkits available for supervised text classification.

Â Scikit-learn is one of them.

Â For those of you who have gone through course three of the specialization,

Â you have seen scikit-learn before.

Â The other toolkit is something that we have seen in this course.

Â That's NLTK. And in fact,

Â NLTK interfaces with scikit- learn

Â and also has interfaces for other machine learning toolkits like Weka.

Â We will not be covering Weka in this course,

Â but I would encourage you to check it out as well.

Â Let's first start with scikit-learn.

Â Scikit-learn is an open-source machine learning library.

Â It started as a Google summer of code in 2007,

Â but it has very strong programmatic interface,

Â especially, when you compare it against Weka which is

Â a more graphical user interface or primarily driven as the user interface model.

Â Scikit-learn is used extensively as a machine learning library in Python.

Â Scikit-learn has predefined classifiers.

Â The algorithms are already there for you to use.

Â So you could use the Naive Bayes Classifier if you want to learn that.

Â So let's go through some steps about what functions you'd use,

Â what calls you'd use,

Â when you're using the Naive Bayes classifier.

Â First, you need to import Naive Bayes from sklearn.

Â Then, you're going to call this naive_bayes.MultinomialNB()=clfr

Â and that would be your Bayes classifier.

Â Now we have seen earlier that there are

Â two big ways in which Naive Bayes models can be trained.

Â One is a multinomial model,

Â other one is a Bernoulli model.

Â And you also have a Bernoulli model here,

Â you have naive_bayes.BernoulliNB() if you want to use that model.

Â Once you have defined a Bayes classifier,

Â you can train that classifier on the training data.

Â You would use classifier.fit and then give the data and label as a two parameters.

Â If you are completely comfortable with this,

Â you could actually merge them two together.

Â You could say naive_bayes.MultinomialNB.fit(train_data, train_labels).

Â Once you have trained the model,

Â you would predict the label for a new dataset using predict function.

Â So you'll have classify that predict and pass the test data that has been get of,

Â you have already extracted the features and so on.

Â And then, when you call it with predict,

Â you're going to be at the labels.

Â And then once you have the labels,

Â you can see how well you have done in your classification.

Â Especially, if you have label test data,

Â you would use metrics.f1_score,

Â that is one of the measures that be use and give the test labels.

Â That is the goal set, the actual labels,

Â the predicted labels and define

Â what kind of averaging you want to do: micro averaging and macro averaging.

Â You have covered some of these concepts already in course three,

Â so I'll not repeat them here.

Â But I would point to some of the reading material

Â around: what kind of measures you could use instead of f1_score,

Â what does f1 mean and what kind of averaging you could do,

Â like micro averaging and macro averaging and so on.

Â Sklearn, that's Scikit-learn, also has SVM classifier.

Â So how do you train as support vector machine or SVM?

Â Well, the goals are very similar.

Â In this case, you are going to import SVM from sklearn and call svm.

Â SVC as the classifier.

Â SVC stands for support vector classifier.

Â As we see, you need to pass some parameters.

Â Typically, for text classification models,

Â you're going to focus on linear classifiers.

Â So it's a linear kernel to consist kernel as linear.

Â And then you can specify the C parameter.

Â We have talked about that earlier in one of

Â the earlier videos and that is the parameter for soft margin.

Â The default values for kernel is RBF,

Â a radial basis function,

Â kernel and the default value for C is one,

Â where you are neither too hard not too soft on the margin.

Â Once you have defined this bayes classifier,

Â you can fit or you can train it the same way as you train the Naive Bayes one.

Â So you're going to see classifier.fit

Â and pass the data and labels as two separate parameters.

Â And then you'll be able to predict the same way as last time you have

Â classified or predict on test data.

Â Now, we need to talk briefly about model selection.

Â You'd recall that there were multiple phases in supervised learning task.

Â And we talked about it earlier.

Â There's a training phase and an inference phase.

Â You have labeled it as that is already labeled,

Â so in this case, it's green and red.

Â And you split that label data into

Â the training data set and the hold out validation data set.

Â And then you have the test data that could also be

Â labeled that would be used to say how well you have performed on unseen data.

Â But typically, the test data set is not labeled.

Â So you're going to train something on

Â a label set and then you need to apply it on an unlabeled tests.

Â So, you need to use some of the labeled data to see how well these models are.

Â Especially if you're comparing between models,

Â or if you are tuning a model.

Â So if you have some parameters,

Â for example you have the C parameter in SVM,

Â you need to know what is a good value of C. So,

Â how would you do it?

Â That problem is called the model selection problem.

Â And while you're training,

Â you need to somehow make sure that you have ways to do that.

Â There are two ways you could do model selection.

Â One is keeping some part of printing the label data set separate as the hold out data.

Â And the other option is cross-validation.

Â So for the first one,

Â if you're doing that in scikit-learn,

Â you're going to save from scikit-learn input model selection.

Â So that will give you the options available to you.

Â And then, first we'll see how you could use that train test split.

Â So I'm going to say model_selection.train_test_split.

Â Give that train that untrained labels and then specify how much should be your test size.

Â So for example, suppose you have these data points.

Â In this case, I have 15 of them and I say I want to do a two third one third split.

Â So my test size is one third or 0.333.

Â That would mean 10 of them would be the train set and five of them would be the test.

Â Now, you could shuffle the training data, the label data,

Â so that you have a randomly uniform distribution around the positive and negative class.

Â But then, you could say I wanted to keep let's say

Â 66 percent in train and 33 percent in test or go 80 20 if you want to.

Â Let's say four out of five should go in my train set and the one out of five,

Â the fifth part as a test.

Â When you do it this way,

Â you are losing out a significant portion of your training data into test.

Â Remember, that you cannot see the test data when you're training the model.

Â So test data is used,

Â exclusively, to tune the parameters.

Â So your training data effectively reduces to 66 percent in this case.

Â The other way to do model selection would be cross-validation.

Â So the cross validation with five full cross-validation,

Â would be something like this where you split the data into five parts.

Â These are five folds.

Â And then, you train five times basically.

Â You train every time where four parts

Â are in the train set and one part is in the test set.

Â So you're going to train five models. Let's see.

Â First, you're going to train on parts one to four and test on five.

Â The next time you're going to say I'm going to train on

Â two to five and test on one and so on.

Â So you have one iteration where five

Â is the test and the regression where one is the test,

Â a third iteration where two is the test and so on.

Â So when you're doing it in this way,

Â you get five ways of splitting the data.

Â Every data point isn't the test ones in this five folds.

Â And then, you get average out

Â the five results you get on the whole test set to see how we'll perform,

Â how the model performs on unseen data.

Â The cross-validation folds is a parameter.

Â In this explanation, I took it as five.

Â It's fairly common to use

Â 10-fold cross-validation especially when you have a large data set.

Â You can keep 90 percent for training and 10 percent as

Â the cross validation hold out data set but because you're doing it 10 times,

Â you're also averaging on multiple runs.

Â And in fact, it's fairly common to run cross-validation multiple times.

Â So that you have reduced variance in your results.

Â Both these models are trained to split and cross-validation are fairly commonly

Â used and critical when you're doing any model selection. Okay.

Â Now let's move to NLTK.

Â How do you do supervised text classification in

Â the natural language toolkit that we have seen in fair detail in this course?

Â NLTK has some text classification algorithms.

Â So for example, it has a naive bayes classifier.

Â It also has decision trees and

Â condition exponential models and maximum entropy models and so on.

Â But the real interesting thing is it has something called

Â Weka classifier or Sklearn classifier that gives uses of

Â NLTK a way to call the

Â underlying scikit-learn classifier or

Â underlying Weka classifier through their code in Phyton.

Â Specifically, if you are using the naive bayes classifier that is available in NLTK,

Â we are going to say from nltk.classify import NaiveBayesClassifier.

Â You're going to say the classifier is now naivebayesclassifier.train.

Â So you're directly going to train on the train set to

Â know that there are no two functions really as

Â common as I could learn where you have

Â a based model and then you have a training function.

Â Here, you are going to say that you have naivebayesclassifier.train and you train

Â this model and you're going to classify it using the classify function.

Â So it's classifier.classify (unlabeled instance).

Â If it's one instance you're going to use the classify function,

Â if there are many,

Â I would going to say classify many,

Â and give a set of unlabeled instances.

Â You also get the accuracy of the performance of the sklearn classifier using

Â nltk.classify util function and then

Â call the accuracy function there where you're giving in the classifier and the test set.

Â And that will give you how well,

Â what is the accuracy of this classifier that you have trained.

Â You can also use other utility functions like labels,

Â classifier.labels tells you all the labels

Â that are there that this classifier has trained on.

Â And you can use some features like this where you have

Â show_most_informative_features that gives you

Â the top few features and again said how many features you want.

Â Say top five or top 10 features that are most

Â important or informative for the classification task.

Â It's especially useful in naive bayes classifiers,

Â when you want to know which features have the most information in them.

Â Which ones are most informative for the following classifier.

Â For support vector machines,

Â there is no native NLTK function.

Â But as I said, you can use the scikit-learn SVM function through NLTK.

Â So here, you're going to say nltk.classify

Â import scikit-learn classifier so SklearnClassifier.

Â And then, you're going to actually use both naive bayes models from scikit-learn

Â using a sklearn.naive_bayes import MultinomialNB or BernoulliNB.

Â And you can use the SVM model that you have seen earlier.

Â So from sklearn.svm import svc.

Â You'll call the function very similar way to how you do it in scikit-learn.

Â And so you have a sklearn.classifier,

Â give the classifier there,

Â the name of the classifier,

Â and.train and then give the train set.

Â Now for MultinomialNB, there was no parameters that you need to pass.

Â That's okay. But for support vector machine,

Â there is one, right?

Â So you need to specify the kernel for example.

Â So you can specify that inside this sklearn classifier function.

Â You're saying that I'm going to call as we see and I'm going to pass

Â parameters where it's linear kernel and the C parameter,

Â for example, can also be specified here.

Â And then you are going to see.train(train_set).

Â The rest are very similar to how you would do in sklearn.

Â You're going to use the predict function and so on.

Â So, the take home messages here are that:

Â scikit-learn is most commonly use machine learning toolkit in Python,

Â but NLTK has its own implementation of naive Bayes and it has

Â this way to interface with scikit-learn and other machine learning toolkits like Weka,

Â by which you can call those functions,

Â those implementations through NLTK.

Â