0:29

There's two activities with this particular lesson, the first is to

Â read about techniques to improve a simple visualization by removing chart junk, and

Â lastly, going through the introduction to data visualization notebook.

Â So first, let's take a look at this article.

Â This is a really interesting article that talks about using Edward Tufte's

Â concepts of data-ink and how we want to reduce the amount of data-ink

Â in a article into a visualization to better convey information.

Â So what this does is it actually has a really neat animated gif that starts with

Â a dataset that's presented in one way and goes through removing

Â data-ink to more clearly highlight the important concepts,

Â and so you can see that it goes through, here's the original graphic, and

Â eventually ends up with something like that.

Â 1:19

Now what about making these visualizations in Python?

Â We're going to look at three and

Â several alternatives on these techniques for visualizing data.

Â The first is a rugplot and this is a simple way to look at the distribution of

Â a one-dimensional data set.

Â We're going to use the tips dataset that we've been using before, and

Â we can simply call Seaborn rugplot and we get this plot here.

Â Now we could do a better job.

Â We could make this look better and we could do so

Â by simply making a few additions, by employing that plot lib techniques, so for

Â instance we can make it a true one-dimensional plot.

Â We can label the x-axis, we can label the title,

Â and we can change the color and thickness of lines as necessary.

Â So, this shows you the distribution of data.

Â While the average might be out here at say, 25, you can see there's a lot

Â more of the data down here, so it's a somewhat skewed distribution.

Â 2:11

We can also compare 2 datasets directly by comparing their rugplots next to

Â each other, and

Â we're going to do this by making two axis on the same Matplotlib figure.

Â To do this, we simply say plt.subplots and we say in this case, we want two rows and

Â one column, and moreover, we're going to have the two rows share the same x-axis.

Â The reason we do this is that in order to compare two data sets over

Â the same range, we want that x-axis to be shared in common.

Â Having done that, here's our two rows and our one column.

Â Here's our figure size.

Â We can compare these.

Â Now, notice what I've done is made lists with two colors and

Â two titles so that we're going to iterate through our two datasets.

Â I've pulled out the tips for lunch and the tips for dinner at the total

Â bill column in particular, as a matrix, and I've appended it to this list.

Â So our list now has one Numpy array, our total bill for

Â lunch, and the second element is the total bill for dinner.

Â We're going to iterate through this dataset and make a rugplot.

Â We're going to clean the rugplot up as we did before.

Â These will now be displayed together and you can see the lunch more clearly skewed

Â to lower total bills, our dinner has a much wider spread, and

Â intuitively, that makes sense, but this visualization really shows that.

Â If we wanted to we could add vertical lines for where the average dinner time

Â total bill is and the average lunchtime, and clearly see those differences.

Â That shows the power of a visualization to convey information.

Â 3:52

The second plot we want to do is called a boxplot.

Â A boxplot takes the ideas of a rugplot, showing that one-dimensional data set.

Â But it actually provides the quantile information,

Â as well as outliers, directly in the plot.

Â So this is probably easier shown than discussed.

Â So here we go.

Â Here's our boxplot for the total bill, for not separated out, at all.

Â The notch shows the median, the 50%, the box spans the middle 50%, and

Â these whiskers show effectively the min and max range of the data.

Â Now the algorithm has a way of identifying outliers, and

Â it shows those outliers as dots.

Â So here we can see that there's some data that's really high total bill

Â that was not included in the quantile analysis.

Â 4:42

So you can see the boxplot, the same information as the rugplot,

Â but very simply shows where the data's concentrated between roughly about

Â $12.50 and about $24 is where most of the total bill.

Â The span is skewed to the right, if you will.

Â There's more data to the right of the median then to the left.

Â That's pretty impressive for a simple visualization.

Â Now we can do the same thing, but split the total bill by the time, and

Â this code here shows how to do that, and here we go.

Â Now we have the lunch and we have the dinner,

Â and you can see that the median dinner bill is higher than the lunch bill.

Â The dinner also has a much wider spread than the lunch bill.

Â Very quickly allows us to compare these two datasets.

Â 5:25

Now sometimes you want to show this in a way that compares the datasset,

Â but not 2, but more, and we can do that by turning the boxplot on its side.

Â Here, we are now breaking out the total bill by day of the week.

Â So we have Thursday, Friday, Saturday, and Sunday, and you could see that the weekend

Â has a wider range than the weekday, and there's a higher total amount as well.

Â The medians are higher and the maximum is higher as well.

Â So again, conveying information, and it was fairly straightforward to make this.

Â We simply had to call the boxplot with our data to tips,

Â dataset, and say, x-axis is going to be by the column day, y-axis by total bill.

Â 6:08

Now, sometimes you want to see the actual data points, and

Â the way to do this is with a Seaborn Swarmplot.

Â Let me just show you, this is exactly the same as the previous plot, but

Â rather than the boxplot, we see the actual data and

Â what Seaborn does is it adds jitter to the points, where basically it moves

Â them a little bit in the X direction in order to allow them to be seen.

Â If we didn't add the jitter we would just have a vertical line at each of these

Â columns and that might be confusing.

Â We wouldn't see the full range of the data.

Â So this is a powerful technique to see the actual data.

Â You can see and compare, there's clumps down here,

Â this is pretty clumpy right here, etc.

Â So sometimes swarmplots are a good way to see that intrinsic structure in a dataset.

Â Now the last thing that we want to look at is histogram.

Â Histograms are things you've probably seen.

Â They're somewhat like a bar chart.

Â To make a histogram, we simply call the histogram or hist method.

Â Here we're going to be passing in the Total Bill column and

Â we actually get a histogram out.

Â Now one thing that you should notice here is I've passed in this alpha parameter, if

Â you've seen that before, what that does is it effects the transparency of the color.

Â You should try running these notebooks and changing this to see how that appears.

Â If this is one, it's much more opaque, it's darker, more bold.

Â With a lower alpha it's a little more transparent, a little softer.

Â We also specify the font size,

Â which changes the size of the text labels that we have at our plot.

Â The rest of this notebook talks about changing things with histograms,

Â like the thinning.

Â The range the histogram goes over.

Â Various other techniques in terms of interpreting histograms,

Â like comparing multiple histograms.

Â I encourage you to go through, and look at these, and try to understand

Â how to make your own histograms, as well as boxplots, swarmplots, and rugplots.

Â If you have any questions about this, or histograms, or

Â these other techniques you've seen to visualize a one dimensional dataset,

Â please let us know in the course forum, and good luck.

Â [SOUND]

Â