0:15

Hello, and welcome to the lesson on introduction to logistic regression.

Â Logistic regression despite its name,

Â is actually applied to classification tasks.

Â Basically, this technique works by performing linear regression

Â on continuous features to make a prediction on a discrete feature.

Â So many of the same things that we've discussed with

Â linear regression will apply to logistic regression.

Â The difference being that this is for classification tasks.

Â Specifically at the end of this lesson,

Â I want you to understand what the basic concepts underneath logistical aggression are,

Â to be able to explain the benefits of using

Â logistic regression and why you may or may not

Â want to use it for a particular classification task.

Â And I want you will be able to apply

Â logistic regression by using Python and the Scikit learn library.

Â Now, all of the content for this lesson is

Â contained in the introduction to logistic regression notebook.

Â Traditionally, when you do regression,

Â you're solving for a continuous predictive value.

Â Logistic regression is for a classification task however,

Â where you're predicting a discrete value.

Â So we're going to follow along with the ideas of linear regression,

Â but introduce them for logistic regression.

Â And in this case, we're going to have to make a prediction into

Â a probability of success or failure which is bounded by the range zero to one.

Â To do this, we're going to have to have a transformation.

Â The most popular transformation uses the logit function,

Â you can also use the probit function.

Â This whole task of using the logic function is known as logistic regression.

Â And that's because the inverse of the logic function,

Â which is what we actually use,

Â is called the logistic function.

Â Now in this notebook,

Â we're going to introduce logistic regression and demonstrate how it can be used for

Â binary prediction and we're also going to show some of

Â the other tasks that are important in logistic regression,

Â including things such as marginal effects and odds ratios.

Â So, first we will start with our standard setup code,

Â before moving into the formalism of what actually is going on with logistic regression.

Â Imagine we have a situation where

Â we can have a binary outcome that would be a success or failure.

Â For instance; Flipping a coin.

Â Head is a success tail is a failure.

Â We can call the odds of success as the probability of

Â success P over the probability of failure 1 minus P. Now,

Â if we take the idea of linear regression,

Â but now I want to map it into this range to zero to one,

Â we can do that by using the logit function.

Â So let's talk about the logit function.

Â The logit function simply takes the logarithm of

Â this odds ratio and this takes this probability,

Â P, in this range,

Â and gives us a value out.

Â If we invert it, we now can turn a continuous value,

Â Alpha, into that probability.

Â There's also the probit function that we can use short for probability unit and that

Â can also be used for logistic regression which is

Â sometimes known as, probistic regression.

Â So, let me show you an example of the logistic function.

Â Here's our Alpha value. It's continuous.

Â It goes from minus infinity to infinity and it maps into a space of zero to one.

Â This dash line shows you a threshold.

Â As we move that up or down,

Â we determined failure below, success above.

Â That threshold can actually be a tunable hyper parameter.

Â Now, in order to figure out the optimal model parameters,

Â we need to do the minimization of the cost function,

Â just like with linear regression.

Â So if we had a linear model like this,

Â we could actually solve it by putting in our Y,

Â our predictive value, getting our logistic function out,

Â this is now going to map into our probability.

Â So we have to have a cost function.

Â The typical way we solve this is with gradient descent,

Â or specifically, stochastic gradient descent;

Â which introduces a randomized component to gradient descent.

Â Gradient descent is a simple concept,

Â it simply computes the derivative or finds a slope of

Â the tangent line to the cost function at a particular point.

Â So let me show you a plot.

Â Here we've got an example.

Â The code just makes this plot.

Â Say this blue curve is our cost function and we want to find the minimum.

Â To do that, we start some place out,

Â in this case X equals two,

Â and we compute the gradient.

Â That's this green dash line.

Â And, it tells us that in order to reduce the cost we have to move to the left.

Â How far we move to the left,

Â we don't know. So we move a little bit.

Â What we do then is compute the gradient again.

Â And we say, "Which direction do we have to move?"

Â In this case, we keep moving to the left until we've reached the minimum.

Â If this cost function was a more complex surface,

Â so for instance if it was in multidimensional,

Â you might have multiple little minimums and we want to find the global minimum.

Â This process can be more complex.

Â And yet, this lies at the heart of all machine learning algorithms.

Â We have some sort of cost function that we need to figure out the minimum for,

Â to determine which is the best model for our data.

Â The standard thing we do is gradient descent or some variant of that technique,

Â but there's other algorithms as well and you'll see that

Â as you become more proficient at machine learning.

Â So, before jumping into logistic regression,

Â I want to introduce logistic modeling where we can

Â model a data set with a logistic function.

Â The data we're going to use for this is the challenger O-ring disaster.

Â The idea is you have temperature

Â and you know whether an O-ring failed at a certain temperature.

Â Before I do that, I want to mention these two code cells.

Â This is an important code cells we're going to check.

Â First we're going to define the data name for our data locally.

Â And that's the first code cell,

Â the second code cell says,

Â does that data file exist?

Â If it does, we do nothing.

Â If it doesn't, we actually use the Wget command to extract it from a remote archive.

Â So in this case, you could see file already exists, we didn't download it.

Â But if we did, this notebook run,

Â it would pull that data down.

Â We're going to use that repeatedly in this course as we introduce new data

Â sets into the analysis with machine learning.

Â So we process this data,

Â we sample a few rows,

Â you can see there's a few features.

Â How many thermal distresses or failures do we have?

Â What's the temperature? What's the pressure?

Â What's the order?

Â We're going to focus on the temperature and whether there was a thermal distress.

Â So looking at this,

Â the first thing we can see is there's one problem,

Â this, we wanted to map between zero and one.

Â One meant there was a distress,

Â zero meant there wasn't.

Â So we have to take care of that.

Â The way we're going to do it, we're simply going to change that

Â to one and we talk about this in the notebook here.

Â So we do that, we now can apply our modeling,

Â we can model this data and we get a result out.

Â We can then make a plot of this data.

Â So here are the actual measured values,

Â whether there was a failure or was not a failure,

Â and you could see at high temperatures,

Â there tends not to be a failure,

Â low temperatures there is and here's our model.

Â We can then use that to make predictions,

Â and in this case it shows that as the temperature gets very low,

Â we have a 100 percent chance of failure.

Â This actually is the demonstration.

Â The challenger to launch disaster was at 36 degrees,

Â so the engineer should have expected a failure.

Â Of course, hindsight's perfect and so

Â we have to be always cognizant of that looking back.

Â Now we're going to introduce logistic regression.

Â We talk about some of the important hyper parameters,

Â then we introduce logistic regression,

Â in this case, we're using the C parameter setting it to something.

Â The reason we do this is because we don't want any kind of regularization.

Â We'll talk about that in a future module which is designed to prevent over-fitting.

Â But I'm just going introduce this value here so it doesn't do any regularization.

Â This is strict logistic regression, no regularization.

Â The rest is notebook here,

Â now we're just going to start talking about test train split.

Â We split our data,

Â we want to make sure we follow a stratified split.

Â That means that when we split it in by training and testing,

Â we maintain the class relationship so that we don't

Â get imbalanced data sets which gives us a model that doesn't predict very well.

Â Here's our fit to that challenge O-ring data,

Â and then we can make predictions based on the temperature.

Â We can also make a classification report just like we did with

Â linear regression and we can make a confusion matrix,

Â just like we did with linear regression.

Â We could also change the data set if we want.

Â First, I want to talk a little bit about performance metrics.

Â This is effectively our confusion matrix that we saw before.

Â We can actually compute things such as the true negatives,

Â false positives, true positives, false negatives,

Â and then we can actually look at these to compute

Â different performance metrics and this table

Â here shows how to use those values to create those.

Â So we can also talk about type one errors and type two errors,

Â these are very important concepts.

Â Typically, they're used in hypothesis testing,

Â which we'll talk about in a future course,

Â but we can also calculate many of these quantities and so we do this.

Â Here's the precision accuracy recall F-one score,

Â and then we can compute the same things but using

Â the built in functions from Scikit Learn,

Â and you can see they give the same values.

Â We could also do more complex fitting.

Â Here we can compute the coefficients of

Â our fit and then predict our data and then make a plot.

Â So this is the same data you saw before.

Â Here's that logistic regression model and now here's

Â our logistic regression itself computing.

Â And notice how we go between the zero,

Â positive zero straight up to the probability of one

Â with our model predicted data, our test data.

Â I introduced the SGD classifier here simply because it can implement

Â logistic regression by default with no regularization so it makes it easy to do it.

Â We don't have to use that C parameter, we simply have to say,

Â use the log loss,

Â and that makes it logistic regression.

Â And we can compute the exact same thing with that.

Â Then the notebook switches over to the tips data set.

Â In this case, rather than making a prediction for

Â total bill like we did in the linear regression notebook,

Â we can actually try to make a prediction on one of the categorical features.

Â In this case, we're going to say, "Can we make a prediction on whether

Â somebody is a smoker based on the total bill,

Â the tip and the size of the party? "

Â And so we go through this notebook,

Â we make the same things we've been doing before,

Â we split our data into test and train.

Â We take our model,

Â we fit our model, we make predictions from our model and you can see the results.

Â We're going to apply other classification algorithms to

Â the same problem in future data sets and see how this changes.

Â We can get our confusion matrix,

Â you can see there's a big change.

Â This doesn't look as good as previous confusion matrices,

Â but that might be okay because sometimes you're more

Â worried about a specific performance such as minimizing

Â false negatives or minimizing false positives and thus shall

Â accept certain types of errors in order to get the results you want.

Â We could also look at other ones,

Â this case we're going to use categorical features in addition to those total bill,

Â tips and size to try to see if it'll improve the results.

Â And so we can go through and you can see that it does improve the results including

Â those not necessarily with this particular value here.

Â We'll see that in the confusion matrix,

Â this was 20 remember,

Â but now it's only gone down to 13 so it's a little better,

Â but we might be able to do better if we did some other technique as well.

Â So the next thing I wanted to get into

Â was showing you how to do this with a formula based.

Â We first need to get our data frame and then we can

Â use the stats model API interface to do logistic regression.

Â So here we're saying our label whether somebody

Â is a smoker or not is related to total bill,

Â tip and size and we can computer our function and get the results out.

Â And so you can see your parameters that go multiplying

Â these particular features and the error on those and the confidence intervals.

Â We can look at other things like confusion matrices if we want.

Â Lastly, then I wanted to go down and look at

Â the last two concepts to introduce in this notebook and that was marginal effects,

Â which is what are

Â the relationships between the different features and making a prediction?

Â And so we can compute those very easily with

Â the stats model and that's with this particular code sell on its output shows,

Â the relationship between the different features and their predictive power.

Â And then I wanted to show the odds ratio,

Â which is important when you're looking at whether

Â a feature contributes or not in a particular way to the prediction.

Â And so I wanted to show this as well, and again,

Â we can get this output very easily with the stats model interface.

Â So I've gone through a lot here in this particular notebook.

Â It will take you some time to go through this but hopefully you'll get a good feel for

Â both the classification challenge in general and

Â the use of logistic regression for classification tasks.

Â If you have any questions let us know, and good luck.

Â