0:00
A lot of the action in machine learning has focused on what
 algorithms are the best algorithms for
 extracting information and using it to predict.
 But it's important to step back and look at the entire prediction problem.
 This is a little diagram that I made to illustrate
 some of the key each, issues in building a predictor.
 So you start of with, suppose I want to
 predict for these dots whether they're red or blue.
 Well, what you might do is have a big group of dots that you
 want to predict about, and then you use
 probability and sampling to pick a training set.
 The training set will consist of some red dots and some blue
 dots, and you'll measure a whole bunch of characteristics of those dots.
 Then you'll use those characteristics to build what's called a
 prediction function, and the prediction function will take a new dot,
 whose color you don't know, but using those characteristics that
 you measured will predict whether it's red or whether it's blue.
 Then you can go off and try to
 evaluate whether that prediction function works well or not.
 1:03
This is always a required component of building every machine learning algorithm
 is deciding which samples you're going to use to build that algorithm.
 But sometimes it's over-looked, because all of the action that you hear about for
 machine learning happens down here when you're
 building the actual machine learning function itself.
 1:19
One very high profile example of the ways that this
 can cause problems is the recent discussion about Google Flu trends.
 Google Flu trend is tried to use the terms that people were typing into
 Google, terms like, I have a cough, to predict how often people would get flu.
 In other words, what was the rate of flu that was going
 on in a particular part of the United States at a particular time?
 1:41
And they compared their algorithm to approach taken
 by the United States government, where they went out
 and they actually measured how many people were
 getting the flu, in different places in the US.
 And they found in their original paper that the
 Google Flu trends algorithm was able to very accurately represent
 the number of flu cases that would appear in
 various different places in the US at any given time.
 But it was quite a bit faster and quite a
 bit less expensive to measure using search terms at Google.
 The problem that they didn't realize at the time, was that
 the search terms that people would use would change over time.
 They might use different terms when they were
 searching, and so that would affect the algorithm's performance.
 And also, the way that those terms were actually
 being used in the algorithm wasn't very well understood.
 And so when the function of a particular search
 term changed in their algorithm, it can cause problems.
 And this lead to highly inaccurate results for the Google
 Flu trends algorithm half over time as people's internet usage changes.
 So this gives you an idea that choosing
 the right dataset and that knowing what the specific
 question is are again paramount, just like they have
 been in other classes of the data science specialization.
 So here are the components of a predictor.
 You need to start off as always in all, any problem
 with data science with a very specific and well defined question.
 What are you trying to predict and what are you trying to predict it with?
 2:56
Then you go out and you collect the best
 input data that you can to be able to predict.
 And from that data you might either use measured
 characteristics that you have or you might use computations
 to build features that we'd think you might be
 useful for predicting the outcome that you care about.
 At this stage then you can actually start to use the machine learning
 algorithms you may have read about, such as Random Forest or Decision Trees.
 And then what you can do is estimate the
 parameters of those algorithms, and use those parameters to
 apply the algorithm to a new data set and
 then finally evaluate that algorithm on that new data.
 So I'm going to just show you one quick little
 example, to show you how this little process works.
 So this is obviously a trivialized version of what would happen in a
 real machine running algorithm, but it gives you a flavor of what's going on.
 So you start off with asking something about the question.
 So you start with a in general
 people usually start with a quite general questions.
 So here is, can I automatically detect emails
 that are SPAM from those that are not?
 So SPAM emails are emails that you got
 that you, come from companies that get sent out
 to thousands of people at the same time
 and that you might not be interested in it.
 4:02
So you might want to make your question a little bit more concrete.
 You often need to when doing machine learning.
 So, the question might be, can I use
 quantitative characteristics of those emails to classify them as
 SPAM, or what we're going to call HAM which
 is the email that people would like to receive?
 4:19
So once you have your question, then you need to find input data.
 In this case, there's actually a bunch of data
 that's available and already pre-processed for us in R.
 So it's actually in the current lab
 package K-E-R-N-L-A-B and it's the SPAM dataset.
 So we can actually load that data set into R directly, and it has some information
 that's been collected about SPAM and HAM emails already available to us.
 Now we might want to keep in mind that that might
 not necessarily be the perfect data, in fact, we don't have all
 of the emails that have been collected over time, or we
 don't have all the emails that are being sent to you personally.
 So we need to be aware of the potential limitations of this
 data, when we're using it to build an algorithm, a prediction algorithm.
 4:58
Then we want to calculate something about features.
 So, imagine that you have a bunch of emails.
 And here's an example email that's been sent to me.
 Dear Jeff, can you send me the address, so I can send you the invitation.
 Thanks, Ben.
 If we want to build a prediction algorithm,
 we need to calculate some characteristics of
 these emails that we can use to be able to build a predictive algorithm.
 And so one example might be, we can
 calculate the frequency with which a particular word appears.
 So here, we're looking for the frequency that the word you appears.
 And so in this case, it appears twice in this email so 2 out
 of 17 words or about 11% of the words in this email are you.
 We could calculate that same percentage for every single email that we have and
 now we have a qualitative characteristic that we can try to use to predict.
 5:43
So if the data in the current lab package that I've shown here are actually,
 information just like that, for every email we
 have the frequency with which certain words appear.
 And so, for example if credit appears very often in the email or money appears
 very often in the email, you might imagine that that email might be a SPAM email.
 So, as one example of that, we looked at the frequency
 of the word, your, and how often it appears in the email.
 And so, I've got a plot here that's a density plot of the, that data.
 And so, on the x-axis is the frequency
 that with which, your, appeared in the email.
 And on the y-axis is the density, or the
 number of times the that frequency appears amongst the emails.
 And so what you can see is that most of the emails that are SPAM, those are the
 ones that are in red, you can see that
 they tend to have more appearances of the word, your.
 Where as all of the emails that are HAM, the
 ones that we actually want to receive have a much higher peak
 right over here down near 0, so there's very few
 emails that have a large number of viewers that are HAM.
 6:49
So, we can build an algorithm in this case let's build a very very simple algorithm.
 We can estimate an algorithm where we want to just find a cut off a constant C, where
 if the frequency of your is above C then
 we predict spam and otherwise we predict that it's ham.
 7:05
So going back to our data we can fig, try to figure out what
 that best cut off is, and here's an example of a cutoff that you could
 choose, so choose a cut off here that if it's above 0.5 then we
 say that it's SPAM, and if it's below 0.5 we can say that it's HAM.
 And so we think this might work because you can see that
 the large spike of blue HAM messages are below that cut off.
 Whereas the big, one of the big spikes of the SPAM messages is above that cut off.
 So you might imagine that wil cache quite a bit of that SPAM.
 So then what we do is we evaluate that.
 So what we would do is calculate for
 example predictions for each of the different emails.
 We take a prediction in that says, if the frequency of yours
 above 0.5, then you're spam and if it's below then you're nonspam.
 And then we make a table of those predictions and divide
 it by the length of the, all the observations that we have.
 And so we can say is that, when you're nonspam about
 45% of the time, 46% of the time, we get you right.
 When you're spam about 29% of the time, we get you right.
 So, total we get you write about 45% plus 29% is about 75% of the time.
 So our prediction algorithm is about 75% accurate in this particular case.
 So that's how we would evaluate the algorithm.
 This is of course any same dataset where we actually calculated
 it, the prediction function, and as we will see in later lectures.
 This will be an optimistic estimate of the overall error rate.
 So that's an overview of, the basic steps in building a predictive algorithm.
Â