0:09

Hello. This lesson is going to introduce

Â scatter plots as a technique for visually exploring two-dimensional data.

Â So far, we've been focusing on analyzing

Â a one-dimensional dataset which is effectively one column in a data frame.

Â And now, we're going to start looking at comparing two dimensions, or two columns.

Â And we do this to see,

Â do they share a positive,

Â negative, or even no correlation,

Â as well as giving us the ability to identify outliers in any sort of relationship?

Â All of these concepts are important because when we go to learn from data,

Â we need to know if there are relationships inherent in the dataset,

Â as well as how do we find outliers or data

Â points that do not follow the trend that the rest of the data follow?

Â This lesson will be following the introduction to scatter plots notebook.

Â Effectively, this notebook will build on

Â the previous visualization notebooks that we've used in this particular course.

Â We, of course, start by setting up our notebook to

Â have all of the visualizations displayed in line,

Â as well as doing our standard imports and

Â setting the warning filter to ignore specific warnings.

Â Now, scatter plots are quite simple.

Â We're going to display

Â the one-dimensional vector x against another one-dimensional vector y.

Â Just like with the plot method,

Â we're going to be using the scatter method in

Â this case because that will actually generate a scatter plot.

Â Plot will actually connect the points with lines and

Â we generally don't want that because it will obscure the underlying data.

Â So how do we do it? Well, in this case,

Â we're going to generate some data.

Â In this case, there linearly space between 0 and 100,

Â and y will be x plus some random noise.

Â We'll then scatter these points as making a scatter plot, and label our plot.

Â And what we end up with is this.

Â And you can see, here is a positive correlation as x increases, y also increases.

Â Now, some things that you may not have seen before in

Â the plot is primarily this method here, the despine.

Â We said, to trim equal true.

Â Which means, to trim away the excess parts of the plot so that the axis here don't meet.

Â And we also, offset them so that we can more easily see the relationship.

Â You should try playing around with these values to see how that affects the plot.

Â We've also arranged our x and y tick marks.

Â Remember, this will start at zero,

Â it will end before 120,

Â and we're going to do it in strides of 20.

Â So we will have 0, 20, 40, 60,

Â 80, and 100 on both the y and x axis.

Â Now, we could also make a dataset that is negatively correlated and display that.

Â A negatively correlated dataset,

Â as one variable increases,

Â the second one decreases.

Â The next correlation is a null correlation,

Â where there is no correlation and this would be a good example of that.

Â As x increases, y doesn't show any distinct trend.

Â Now, one other important point about

Â a scatter plot is that we cannot just find correlations,

Â we can also see data points that sort of lay a way from the main trend.

Â So this particular code cell does that.

Â It makes a positively correlated dataset,

Â as x increases, y increases.

Â But we also have these two data points over

Â here that are clearly outliers from the trend.

Â If we were to do some sort of analysis,

Â we might determine a model which can model this relationship between x and

Â y while also trying to understand why are these outliers present?

Â Is it because a machine was incorrectly reporting values?

Â Is it because a person entered the wrong data accidentally?

Â Or, is it a potential case of fraud because somebody intentionally

Â massaged the data to better reflect on themselves?

Â So visually, looking at data,

Â makes it easy to see a trend or to spot outliers and that's one of

Â the clear benefits of actually using scatter plots.

Â Now, in the previous examples,

Â we looked at just one relationship.

Â In this case, between x and y.

Â But we can also look at multiple relationships.

Â So first, we're going to load in the Iris dataset and

Â compare the sepal length versus the pedal length.

Â We could also look at comparing multiple datasets.

Â In this case, we're going to look at the sepal versus pedal comparison.

Â And what we've done here,

Â if we come up and look at our scatter plot,

Â we are comparing in red,

Â the sepal length versus the pedal length,

Â and in blue, the sepal width versus the pedal width.

Â So this is two different relationship shown on

Â the same plot and we've distinguished them by color coding.

Â We could actually add in here a legend to indicate those differences.

Â And we could do that easily if we simply added

Â a label flag to this particular scatter plot saying,

Â 'length comparison' and another label to this one saying,

Â 'with comparison' and then we called 'legend'.

Â We'll see examples of this in later notebooks.

Â We can also compare datasets to trends and we can also compare multiple scatter plots.

Â Here is a similar example to the rug plot that we saw.

Â But in this case, it's actually a scatter plot where we're

Â trying to see the correlations as it might exist.

Â But perhaps most importantly, when we try to do this,

Â there is a built in function in Seaborn that creates what's called a pair plot.

Â I like to think of it as a spreadsheet plot.

Â And that we are plotting,

Â different columns or features against other features.

Â So you can see the first column is sepal length,

Â the second column is sepal width et cetera,

Â and the first row is sepal length.

Â Now, the diagonal elements of this array of plots is sepal length against sepal length.

Â So the way we represent this,

Â is by actually drawing a histogram instead of a scatter plot.

Â The Off diagonal elements then,

Â are actually the scatter plots.

Â And so, you can see that it's symmetric.

Â Any plot that's down here,

Â is reflected on the other side of the diagonal.

Â And here we are color coding each plot by

Â the three different Iris species that are present.

Â So this shows you, it's a real simple way if we actually go up here and look,

Â it was one line of code once we've read in the data frame.

Â This is the Iris data frame to make of this plot.

Â And it quickly shows the clustering that's present in the data.

Â The Setosa is off by itself,

Â and diverse color in the virginica are somewhat separated here.

Â We also have nice trends between

Â these nice positive correlations between these variables.

Â We also have a nice positive correlation

Â between these two species and this particular plot.

Â And this one's different. We can also see some outliers in specific examples.

Â So you can see very quickly this pair plot makes

Â a very powerful visualization when you're just starting to explore

Â your dataset in terms of giving you clues

Â to relationships that you might want to explore in more detail.

Â So a good example, pedal length, pedal width.

Â We can clearly see that sort of linear positive relationship.

Â I hope this has given you

Â a nice introduction to the power of scatter plots and the ability to

Â use these two-dimensional visualizations

Â to better understand what's going on in your data.

Â If you have any questions,

Â be sure to let us know in the class forum. Good luck.

Â