0:10

So in this lecture we're gonna have

Â kind of a very different sort of lecture than what we've been doing for

Â most of the class and we're just gonna talk about plotting your data.

Â And the reason I held off on talking about plotting your data til now

Â which might as well be the most important aspect of data analysis.

Â Is because I wanted to demonstrate that some of these plots are estimators,

Â they're multi-varied estimators but they estimate things like densities.

Â And we couldn't really introduce plotting to estimate densities unless everyone knew

Â what a density was.

Â So, let's get right to it.

Â 0:47

One of the most well known forms of plot is just simple histograms.

Â Histograms just display a sample of the estimate of the density or

Â mask function and they're basically just bar graphs of the frequency or

Â proportion of times that a variable takes specific values.

Â Or bins of values for continuous data.

Â So it's probably easier to explain this with examples than with words.

Â So the data set islands in R, is the package that

Â contains the areas of all land masses in thousands of square miles, so

Â you can just load it into R by just typing in data islands.

Â 1:26

And you can view the data by typing islands and

Â it'll just show you the list of numbers.

Â And we can create a histogram with a command hist of islands, and

Â if you just do question mark hist, it'll give you the options for the command hist

Â which includes things like been, lengths, and how you break up the histogram.

Â Picture we see that we had 41 islands that had this range of area.

Â You can see this is a very crummy histogram.

Â It doesn't tell you much information.

Â And the reason is because most of the islands are really small.

Â 2:23

And there's only a handful of big ones.

Â And so, maybe there's something more informative that we can do while

Â Let's talk about the pros and cons of the histograms first.

Â So histograms are useful.

Â They're easy.

Â They make sense.

Â They work on discrete and even unordered data.

Â You can make a histogram of M&M colors or hair colors, or whatever.

Â They're just bar plots of frequency.

Â There's some problems with them, they use a lot of ink and space to display.

Â Not so much information,

Â you can replace with a table of the frequencies pretty easily.

Â And it's a little bit difficult to compare several at a time.

Â And then as I pointed before the specific use of the histogram for

Â this data set isn't very good.

Â You should maybe log the data to spread things out a little bit.

Â In this case, I did log base ten.

Â So it gives you orders of magnitude.

Â And then if you look, the histogram's much better.

Â And so the numbers now on the horizontal axis are log tens.

Â So you're looking at orders of magnitude, which is probably a lot better to look at.

Â 3:25

Stem and leaf plots are just another way of really quickly on the fly,

Â if you ever need to create a histogram and all you have is a pen and

Â a piece of paper, then a stem and leaf plot is definitely what you want to do.

Â And it was created by John Tukey,

Â who was a famous statistician that created lots of inventions.

Â He was one of the co inventors of the fast discrete Fourier transform and

Â many, many other statistical and signal processing techniques.

Â A stem and leaf plot is, basically what you do is,

Â you have to pick a digit that you're going to kind of break the data on.

Â 4:03

And then you put the digits to the right of that, you stack them up.

Â So here it's probably easier to just show this than to actually describe it so

Â I type in stem log ten of islands and the decimal point is at the horizontal line

Â here, which in the case is a bunch of pipe characters.

Â So, when we look at the one on the left, the one on the right means that's 1.1.

Â One island had land area ten to the 1.1.

Â Because remember we took log 10.

Â And then we can count, let's see how many is there?

Â There's one, two, three, four, five, six ones.

Â So there was six islands that had log area 1.1 and then because they were so

Â many in those bins they broke the 1 bin up into those below five.

Â And those above five.

Â So any rate you can see how you can do this really quickly.

Â You just pick the decimal place and the you just, the rounded number immediately

Â after the decimal place you just start stacking them up.

Â And then they wouldn't be in order so then maybe you'd do it again where you

Â reshuffle the numbers on the right so that they're in order.

Â It makes for a very convenient plot.

Â It gives you a quick histogram.

Â It's a quick density estimate, and you can do it on the fly really easily.

Â 5:14

Another useful plot is a dotchart.

Â Dotcharts just display the entire data set one dot per point.

Â And you'll say well that seems like a pretty uninformative plot but

Â it's usually quite informative especially if you can order the dots well.

Â So the ordering of the dots and

Â the labeling of the axes can display a lot of information.

Â And so the dotcharts show an entire data set that could in

Â principle reconstruct the data set from a dotchart.

Â So it has really high data density, but there's problems with them.

Â They may be difficult to construct and difficult to interpret for

Â data sets with lots of points.

Â You may get over plotting and stuff like that.

Â So if you look at this dot chart where I did the dot chart for the log10 area,

Â that's log10 square miles, you see it just plots all of the data.

Â It's the same thing before but it gives you a good idea of the density points.

Â And you could probably make this chart a lot.

Â You could really improve on this chart, for

Â example, by maybe grouping land masses together.

Â Pacific Islands, maybe put all of those together, for example.

Â And so on, so you could get a lot more information out of it.

Â But at any rate, that's a dot chart there.

Â Good to make.

Â So yeah, just on the next slide I mentioned that I ordered everything

Â alphabetically, which is the default for the dotchart,

Â but you can play around with it.

Â Playing around with plots is the key thing you want to do,

Â you want to just keep doing them until you get informative information.

Â 7:18

And you can obtain this data set with the command data(InsectSprays) and

Â then I give you the code for actually creating this plot there.

Â Maybe some of the fancier parts of the plots I omitted here, but

Â this is the basic gist of how I created the plot.

Â If you can generate better code to do this, I'd love to see it.

Â But anyway, then you look at the plot on the next page and

Â the nice thing about these plots is that it displays every single data point.

Â You get a very good sense of what's going on.

Â You can see that the spray C, D, and E appear to be different than the sprays A,

Â B, and F.

Â For example you have the confidence intervals for

Â each of the groups and you have the mean for each of the groups and

Â they display a lot of information for example.

Â You know for group F, you can see these kind of bi-modality of the distribution.

Â You can see this, sort of, outlier in group D, or maybe it's not an outlier but

Â nonetheless, you can visualize these things quite easily with a dot chart.

Â And because every group only has, you know, a handful of points

Â it would be a shame to aggregate them into something else that obscured the data.

Â 8:23

But what you might ask yourself, but what if instead of having like ten or 15 or

Â 20 points per group, what if I had 10,000 points per group?

Â What should I do.

Â Well, then maybe don't do a dot chart anymore, because you're gonna get so

Â much over-plotting that you're not gonna be able to see anything meaningful.

Â Maybe do something like a box plot.

Â So box plots were also invented by Tukey and

Â they basically just show the distribution in terms of quantiles.

Â So the center line of the boxes in a box plot represent the median,

Â while the box edges correspond to quartiles.

Â So the boxes give you some information about the density,

Â representative there only by three numbers.

Â And then the whiskers extend out to a constant times the inner quartile range,

Â which is the difference between the 75th and 25th percentile data.

Â Or they cap them off at the max value, and that constant was given.

Â Tukey did it by kind of relating it relative to standard normal.

Â And then sometimes outliers are denoted by points beyond the so-called whiskers.

Â And you can see skewness in the data by the centerline being near one

Â edge of the box or one of the other edges.

Â So if we just take the same insect spray data and did a box plot, in this

Â data set you probably don't wanna do that, but you do get the same relative picture.

Â You've lost some detail.

Â You've lost information about that bimodality or the density for

Â insect spray F, but you do notice, maybe something's going on there,

Â that distribution's very skewed towards higher values so

Â you might investigate it more and discover that kind of little two group clustering.

Â 10:03

It does kinda catch these outliers for C and D but sometimes you have so

Â many data points that the outliers are just a constant mash of outliers and

Â then there's no point in plotting them.

Â But at any rate, this is a box plot.

Â I think for this particular data set you're better off doing the dot chart, but

Â if you had lots and

Â lots of observations a box plot seems like a pretty reasonable thing to do.

Â There's been improvements on box plots,

Â people say why do it as a box when I'd do it as something that uses a lot less ink,

Â and what's this constant times the IQR business?

Â That's maybe a little bit difficult to interpret.

Â And so there's been lots of kind of refinements, but

Â these are your plots ultimately when you create them so

Â you can use them to investigate the things you really wanna look at.

Â But it's a very reasonable idea to create in this case vertical

Â summaries of group data that are based on distributional properties, like means,

Â medians, quartiles and so on.

Â If your boxes get too squished try logging your data if it's positive.

Â Or maybe you could do a cube root if it's positive and negative.

Â 11:07

For data with lots and lots of observations,

Â you often want to omit the outliers because you get this big mash of

Â black from over-plotting that you can't see individual points.

Â And there is no point in calling them outliers anymore if there's

Â hundreds of them.

Â Here's an example of a bad box plot.

Â I just give you some R code to generate a bad box plot, right.

Â It's all squished together, the outliers.

Â There's too many outliers being displayed and the fact that there's so

Â many outliers means that all the interesting aspects of the kind of

Â meat of the data are obscured by the handful of outliers.

Â So that's box plots.

Â So box plots kind of give you a density estimate by grabbing a bunch of

Â quantiles and using them from the data.

Â Kernel density estimates, on the other hand, are direct density estimates in

Â the same way histograms are direct density estimates.

Â But kernel density estimates maybe are a little bit better.

Â And the ideas that you're waiting observations according to a kernel,

Â in most cases that kernel is a Gaussian density, and

Â then you have to pick a parameter that determines how smooth or

Â jiggly your density estimate is going to be, called the bandwidth.

Â And your density estimate is itself a statistical estimate.

Â It has variability that you should probably investigate as well.

Â And you should investigate how the bandwidth impacts that variability and

Â the estimate itself.

Â But it's not like this is something that's just unique to kernel density estimates.

Â For example, if you take a histogram, the width and

Â construction of the bins in a histogram play the same role as the bandwidth, so

Â you still have that tuning parameter you have to work with.

Â And again,

Â the width and the number of bins in a histogram can impact what it looks like.

Â But in addition, a histogram's also an estimate with noise, so

Â both kernel density estimates and histograms, and so on,

Â they all are statistical estimates that have variation and

Â it's maybe unfortunate that I'm gonna do this as well that, when you plot these

Â things, you don't explicitly acknowledge the uncertainty in the density estimation.

Â So that is kind of a problem.

Â But maybe the solutions to it are a little bit above the discussion for this class.

Â So anyway, the R function density can be used to just create density estimate.

Â So here's the waiting and eruption times in minutes

Â between eruptions of the Old Faithful Geyser in Yellowstone National Park.

Â You can grab this data by just doing data(faithful), and

Â d is the density estimate and

Â this bandwidth parameter here gives a specific rule for selecting the bandwidth

Â of the density estimate, and then the plot creates the plot.

Â So there's our density estimate, and

Â it actually gives you the specific bandwidth it used.

Â And you can see there's an incredibly obvious feature in this data set at

Â 4.5 minutes, let's say, that the eruption seemed to occur in two time periods.

Â But you also get a sense of the variation around those eruption times as well.

Â So anyways, kernel density estimates are a nice way to estimate a density and

Â maybe are an improvement over a histogram, I think by smoothing out the data.

Â 14:13

Here's another exam.

Â I took an MRI.

Â I took a single axial slice and then I disregarded the spacial locations.

Â So I just have a bunch of color intensities that are on a gray scale.

Â So here's the image.

Â You can see this is an axial slice.

Â Here's the ventricles.

Â Here is gray matter.

Â Here is white matter.

Â Here's the skull.

Â And if you were to just take these collection of numbers, or

Â the intensities of the numbers, and disregard where in the image they are and

Â just treat them as a list of numbers, and put that into a kernel density estimate,

Â you might get something like this, where you can see kind of specifically where you

Â have this lots of background voxels, you have a kind of a hump for

Â gray matter voxels, a hump for the white matter voxels, and so on.

Â 14:58

And this is a pretty common technique.

Â In fact, it's so common that it's built into your camera, right?

Â Your digital camera,

Â if you have one, often will have a histogram estimate built into it.

Â You can actually look at the image histogram.

Â And if you don't do it on a digital camera,

Â certainly whatever image processing software you have

Â will actually give you a histogram of the intensity values from the image.

Â And this is exactly what they're doing.

Â And if they do a kernel density estimate, they're smoothing out that histogram.

Â If they do boxes, then they're just discourtising the histogram.

Â Quantile plots are extremely useful for

Â comparing a distribution to a theoretical distribution.

Â So a great example of this is if you want to suggest that your data is normally

Â distributed, you might want to compare your empirical quantiles from your

Â data to the theoretical quantiles of a normal distribution.

Â So if there is a significant departure from a line,

Â then that's going to tell you that the quantiles of your empirical data

Â don't look like the quantiles from a theoretical normal distribution.

Â Then it's a useful diagnostic tool.

Â And the reason it's useful is, unlike histograms,

Â you could do a histogram plot of your data and then compare that, overlay it, say,

Â on a standard normal plot.

Â The reason is that QQ-plots are good is that they kind of focus

Â exactly in on the comparison between the two distributions, you know,

Â quantile by quantile.

Â And they really tend to highlight the differences much more effectively than,

Â say, overlaying two histograms.

Â It's kind of hard to tell.

Â But here, you know, here's why you want to check for whether or not it's a line.

Â So let's let Xp be the pth quantile from a normal mu sigma squared.

Â Distribution is nonstandard normal.

Â Then by definition of probability, x being less than or equal to x sub p is p.

Â 17:05

So we've just basically converted the x random variable to a z random variable.

Â So then, you know, you can go back and forth between the x quantile,

Â x of p is mu, plus the z quantile, z sub p times sigma.

Â So, again, I put here, this should not be news to you, that you can convert between

Â nonstandard quantiles and standard quantiles by either standardizing

Â the nonstandard quantiles or by doing mu plus sigma times the standard quantiles.

Â 17:36

So any rate, the result is quantiles from any nonstandard normal distribution

Â should be linearly related to standard normal quantiles.

Â So what a normal QQ-plot does, for example, is it plots the empirical

Â quantiles of your data versus theoretical standard normal quantiles.

Â And in R qqnorm, it does a QQ-plot, and then qqplot just basically plots your

Â empirical quantiles versus theoretical quantiles from any distribution.

Â And here's an example of a normal Q-Q Plot.

Â And in this plot it's basically saying that at the high end your sample quantiles

Â are too large, and at the low end your sample quantiles are too small negative.

Â So in this case it means your data is heavier tailed than a standard normal.

Â It has Excessively large upper quantiles and

Â excessively small negatively lower quantiles.

Â In this example your upper quantiles are too large,

Â right, and your lower quantiles are all smooshed up at zero, and that.

Â Would be indicative of an instance where your data follows some

Â right-skewed distribution rather than a standard distribution.

Â I think in this case to generate this plot,

Â I used a gamma and compared it to a standard normal.

Â And then here's an example where I generated data from a normal distribution

Â And plotted the Quantile-Quantile plot versus the actual normal distribution,

Â and of course, it looks pretty good.

Â Again, with the QQ plot, the theoretical quantiles are,

Â of course, exactly right, but the sample quantiles are measured with noise, so

Â the normal QQ plot again doesn't account for the uncertainty

Â in estimating those quantiles and so really you should have for these QQ plots

Â maybe some grey lines to indicate the uncertainty around the plot itself.

Â 20:01

And you wanted some plot of the bivariate distribution for

Â two discreet random variables.

Â Well, here's Fisher's data on hair and eye color, right?

Â So we want to talk about the distribution of hair and

Â eye color when Fisher was looking at people from a particular area.

Â And here's the contingency table down here, and

Â you see the different hair and eye colors.

Â And one plot that you could do to look at this is the so-called mosaic plot.

Â 20:31

Mosaic plot just sort of breaks everything up into squares and rectangles

Â where the size of the rectangles represents the size of the counts, and

Â so it gives you a, a pretty immediate way to look at

Â the bivariate distribution of the two variables.

Â It's sort of like getting at a two-dimensional bar chart.

Â It's a quick display.

Â You know, another thing you could do perhaps is some sort of 3D bar chart,

Â but what's nice about this is it doesn't require, it's standing it out to a third

Â dimension and gives you a quick way to look, you can see the low counts for

Â red hair, for example, across all eye colors.

Â And you can see that really quickly and

Â obviously in the large, green being very consistent across eye colors,

Â for example, but that fair hair seems to change with eye color quite a bit.

Â So you can pick out these patterns really quickly.

Â And so I'm going to say mosaic pots are nice,

Â if maybe a little underused techniques in plotting.

Â 21:35

So that was a whirlwind tour of some basic plotting techniques that you can use and

Â hopefully will get you up to speed and running really quickly with plotting.

Â I wanted to mention at the end though that you should really

Â not constrain yourself when you're plotting your data.

Â You know, think of these techniques or anything's fair game.

Â Plotting exploratory data analysis is an essential component of applied statistics,

Â and you can't really do any of the probability modeling that I'm suggesting

Â unless you dive into the data a little bit first.

Â It will give you a sense of whether your

Â probability modeling is ridiculous before you even start.

Â So it's an essential part.

Â And today we just gave you a handful of techniques, but when confronted with

Â a problem, you should attack the data with as many plots as you can think of and

Â they tend to be very informative.

Â I think Tooki called it intraocular content, in other words,

Â that the conclusion sort of hit you right between the eyes.

Â And that's what plots can do for

Â you that probability models can't do anywhere near as well.

Â