0:16

Hello and welcome to the introduction to machine learning lesson.

Â This lesson will start by introducing four types of

Â data analytics that are commonly used in business communities.

Â Next, we'll actually get into the basic tasks of machine learning.

Â This includes things such as data cleaning and pre-processing.

Â While they aren't the most exciting aspects of analytics or machine learning,

Â they're incredibly important because you can't make sense of

Â data if the data isn't clean and ready to be analyzed.

Â Finally, we'll introduce the four main categories of machine learning.

Â These include classification, regression,

Â clustering, and dimensional reduction.

Â And we'll also talk about how to persist machine learning models.

Â This last aspect is important because if you

Â spend a lot of time building a model you want to be able to

Â save it so that you can reuse it or deploy it on different computational hardware.

Â So key things you should be able to do by the end of this lesson are: explain

Â the difference between supervised and unsupervised learning,

Â explain the differences between regression and classification,

Â understand and articulate the basic concepts of clustering and dimensional reduction,

Â and be able to use the scikit learn library to perform these basic tasks.

Â Now, a key aspect here is I don't want you to be experts after this lesson.

Â You're going to have many modules to learn these things in more detail.

Â This is simply to give you

Â a high level overview of these concepts so that you'll be able to

Â understand them and talk about them and be

Â ready to dig into them more deeply in future lessons.

Â There's going to be a reading on the four types of

Â data analytics that are typically encountered in data science as well as a notebook.

Â So first, let me go to the readings and I'm

Â going to explode this so it's a little more easy to see.

Â You've probably seen this example in other cases.

Â This is not something that's unique to this particular website.

Â But the idea that there's really four things that you can do with data.

Â The first is descriptive and that sort of is what's happening right now.

Â And the idea is you have some data and you're trying to make sense of what's going on.

Â The second is diagnostic and this is why are the things that I'm seeing happen,

Â why are they actually happening.

Â And so the idea is to be able to go beyond what merely is

Â the descriptive aspect and to try to understand why that's happening.

Â The third is predictive and that's a little bit more future looking where you say,

Â "I think I understand what's going on right now,

Â what's likely to happen in the future."

Â And lastly is the prescriptive, which is,

Â what do I need to do to be able to capitalize on

Â the things that I'm seeing as potential future outcomes?

Â So this article will go through these in more detail,

Â I encourage you to look at this,

Â it's an important idea and there's a lot of good information in here.

Â The main part of the lesson though is going

Â to be the introduction of machine learning notebook.

Â In this notebook, we're going to introduce the scikit-learn library,

Â which is shortened to sklearn.

Â And we're going to show how to do some basic steps

Â in the machine learning process and that includes some data pre-processing,

Â data scaling and then specifically give examples of classification, regression,

Â dimensional reduction and clustering before

Â ending with a demonstration of model persistence.

Â Now in the interest of time,

Â I'm not going to step through every single part of this notebook.

Â I simply want to highlight a few things.

Â First, you're going to see this particular code sell

Â or something very similar to it at the start of every notebook.

Â This of course is our standard setup,

Â where we import most of the modules we're going to use.

Â We set some warnings so that we ignore,

Â in this case, these are specific to pandas.

Â We sometimes get warnings that they're not important.

Â And then lastly set something for the visualization.

Â The first step is data exploration and there actually is

Â a standard technique that's used called

Â Cross Industry Standard Process for Data Mining or CRISP-DM.

Â And the very first step is data understanding,

Â data preparation, business understanding,

Â so before you try to apply any fancy machine learning technique,

Â you need to make sure you understand what it is you're actually trying to do.

Â This may seem obvious but often people get

Â excited and want to just run in and try some algorithm

Â and make some fancy prediction without

Â understanding the exact details of what they're trying to solve,

Â and you don't want to solve the wrong problem.

Â If you think about it in terms of a homework assignment,

Â you always want to solve what you're being asked for not something else.

Â So in this notebook, we're going to use the standard Iris data set.

Â It's not a fancy new data set but it allows us to

Â show how the algorithms are working without having to constantly change data sets.

Â We will use several different data sets

Â throughout the machine learning aspect of this course.

Â The Iriss one is one we'll use frequently because it demonstrates easily lots of

Â the concepts that we want to demonstrate with machine learning and so we can do that.

Â So first, we load the data set,

Â we then use the sample technique,

Â I like sample because it just picks five random rows.

Â And in this case, we've said five so that's why there's

Â five and it's useful because

Â every time you run a trick it get's a little something slightly different.

Â We can also group by the target label,

Â in this case species, and say how many do we have.

Â You could see we have 50 of each.

Â So in this particular case there's 150 rows or instances in our data-set.

Â There's four features or columns and they're balanced across all.

Â The other nice thing is we're not missing any data here.

Â That's nice if we were,

Â we for instance might see a 49 here or a 48 here.

Â We could also compute some descriptive statistics with the describe function.

Â And here again, we see that there's 150 instances in all columns, that's good.

Â We can then look at making some simple visualizations,

Â in this case, we're going to use the pair plot or pair grid,

Â which makes a very fast visualization showing the scatter plots between

Â different pairs of features and

Â the diagonal is that a histogram of that particular feature.

Â We've color coded the points by the particular species.

Â So looking at this you can see some things very fast.

Â First of all, there's a really strong relationship between pedal length and pedal width.

Â That's interesting and something we're going to want to come back to.

Â Secondly, we notice that if we look at sepal with

Â versus pedal width that there is a natural clustering

Â of points here and there's a natural cluster of

Â green here and red here with just a slight bit of overlap.

Â These are important things and one reason that I like to make these sorts of

Â pair plots because it allows you to quickly

Â visualize the relationships between different features.

Â And the visualizations are very good at giving you that insight very quickly.

Â Next, we do a few things here we're specifically pulling out our columns from

Â our data frame into an array that

Â will allow us to apply machine learning and then secondly,

Â we're creating a new array called labels which is going to

Â be numerical with the value zero, one, two.

Â And the way those are ordered is that setosa will be zero,

Â versicolor will be one virginica will be two.

Â This here is simply integer division.

Â So we take i for i and range so this is going to be 150.

Â So we're going to have an array of 150 and

Â the first 50 elements will all be zero because it's integer division.

Â And then once we get to 51,

Â it's 50 and then 51 it's going to be

Â one until we get to 100 at which point it will turn two.

Â So this is just a quick way of building up an array

Â where we have a integer that corresponds to each class.

Â We can make a quick plot that's we're showing our data,

Â this is now showing you that clustering that we saw before.

Â And then, we can start getting into machine learning.

Â So three key aspects here I want to highlight,

Â first is the difference between supervised and unsupervised,

Â that's basically supervised techniques use training data or labels to make predictions,

Â unsupervised doesn't need that.

Â The second is that this idea of dimensional reduction is when you have lots of

Â features you often want to truncate that or use a smaller subset.

Â Ideally, that smaller subset still encodes a lot of the information,

Â that's a useful technique.

Â And then lastly is clustering,

Â which is really a powerful technique that allows you to find data that

Â are grouped together somehow and to be able to treat them then as a single entity.

Â A good use of clustering in business is when you're trying

Â to do customer segmentation and you're trying to say,

Â "Here's a bunch of customers that are all similar that are using our particular product."

Â You may want to find for instance,

Â high income customers and this might be a way to do that or

Â customers that you need to worry about losing to competitors.

Â The rest of the notebook then steps through these.

Â We talk about a few things like parameters and hyperparameters,

Â it's important to understand these.

Â We'll be doing a lot with hyperparameters.

Â Hyperparameters are something that we have to figure out what

Â the value is and we can't do it ahead of time.

Â And so you need to run a model multiple times and see what's the best value.

Â And then we're going to step into scikit-learn.

Â In particular, we're going to see things such

Â as splitting our data set into a training set and a testing set.

Â If you train your model on data,

Â and then you use the results of that as the accuracy,

Â you're going to be susceptible to problems.

Â You always want to have a set of data that has not been seen by

Â the model to use for your prediction accuracy that will be more realistic.

Â And so we want to do this, where we split into a train, test split.

Â We often also will use a random state here.

Â The reason I do this is because this allows our notebook to be reproducible.

Â If I didn't set this,

Â and we'd use some random state that always changed,

Â the results would always change.

Â And that would make it hard to write a notebook and to be able to

Â convince others of what you're doing because the results constantly change.

Â Sometimes that's okay, but in general in these notebooks,

Â I will specify a random state.

Â We also might want to scale the data.

Â This is important if you're going to be doing certain algorithms.

Â If you have a feature that say,

Â goes from zero to one and another feature that goes from zero to 1000,

Â it's hard for a machine learning algorithm to treat those the same

Â way because the one has such much larger range.

Â So you can scale data in different ways,

Â and we show some of these here.

Â You can normalize them to have a unit mean and variance or maybe to have a certain range.

Â And you typically want to do this,

Â you want to do it to both the training and testing data,

Â and so we mentioned that here.

Â So here, we're implementing the StandardScaler,

Â which gives us a zero mean and standard deviation of one.

Â So here's the original data and here's the scaled data,

Â and you can see how it's changed.

Â Next, we're going to step into classification.

Â We simply make a model and we say, "How accurate are we?"

Â First thing I want you to notice, look how simple it was to do this.

Â We have our data already taken care of.

Â We import our particular estimator,

Â we say, we want five neighbors,

Â we run it, we fit our model,

Â and then we get our accuracy on our test data.

Â And you can see the accuracy it's pretty high, that's pretty impressive.

Â And this is the beauty of scikit-learn as a library,

Â it's very simple to apply things.

Â This is only a few lines of code,

Â and we create a model,

Â we train it, and we score it.

Â Regression, we use a different model,

Â in this case, a decision tree.

Â And don't worry about the details of these models.

Â We're going to have entire lessons devoted to them,

Â so you'll have plenty of time to dip into that.

Â Here again, we create our training and test data.

Â We then apply it, in this case,

Â to a regression model,

Â where we are trying to predict a continuous value.

Â Classification was putting things into bins.

Â Regression is trying to predict a continuous value.

Â So for instance, if you try to predict future income for a company,

Â you would want that as a regression, not a classification.

Â Then we look through dimensional reduction,

Â which is trying to reduce the number of features.

Â We apply this, in this case, to the Iris data.

Â Then we say, "How many features do we need?"

Â It turns out we don't need a lot.

Â And that makes sense if you think about the pair plots.

Â There are a few combinations of features where the data was highly separated.

Â And then lastly, we have our clustering.

Â And here we apply clustering to the data and we get nice little centers.

Â And that's what this last visualization shows.

Â What are our computed clusters?

Â The purple stars show this.

Â Here's one cluster center,

Â here's another, and here's a third.

Â And then the last thing was the model persistence,

Â which show how to actually save a model.

Â We can do this by using Python's pickling capability in particular,

Â with the joblib library.

Â Very simple, these four lines save the model,

Â we then show the model right there.

Â And then we can recreate the model into our notebook and apply it.

Â And you can see that we still get the same results.

Â So with that I'm going to go ahead and end.

Â I realize this was a very long video,

Â but there were a lot of things that I wanted to get through.

Â Hopefully, I've given you a taste of the importance of machine learning,

Â the different types of data analytics,

Â and in particular, the supervised,

Â unsupervised learning, classification, regression,

Â dimensional reduction, and clustering.

Â We'll be going through all of these in much more detail in future lessons.

Â If you have any questions,

Â let us know, and good luck.

Â