0:02

This lecture's going to be about some

Â basic principles for building analytic graphics.

Â The basic goal is to provide some general

Â rules that one can follow when we're building

Â analytic graphics from data, and we're trying to

Â tell a story about what's happening in the data.

Â I find these rules to be quite useful when thinking

Â through building data graphics and they apply to many different situations.

Â 0:23

so, I've kind of cribbed these rules from, a book by Edward Tuffey.

Â I'll give

Â the reference at the end of the lecture.

Â This is from his book, Beautiful Evidence, where he,

Â where he kind of goes through a number of principles,

Â and so, I'm going to talk through some of these ideas

Â and kind, and kind of show them, in, in context.

Â So, the first principle, that Tufty talks about, is to show comparisons.

Â And this is a very basic idea in all of science.

Â And the basic idea is that evidence for a hyp, for a hypothesis or an idea about

Â the world, is always going to be relative to another hypothesis, right?

Â So evidence is always relative.

Â 0:59

And so, if you're comparing Hypothesis A, there has to

Â be some alternative hypothesis that you're going to compare it to.

Â And so, so whenever you hear a statement or hear a a summary of

Â evidence, based on data you should always be asking a question compared to what?

Â Alright, so here's a quick example.

Â The basic idea is that

Â we have a, a boxplot which looks at the effect of an air cleaner

Â on the asthma symptoms of children who in a certain, in a set of homes.

Â So the idea is that an air cleaner was introduced

Â into a child's home, to reduce indoor air pollution levels.

Â And we want to see if their asthma symptoms are improving.

Â So the outcome is, it's what's called symptom-free days.

Â So a higher number is better.

Â And here we can see that the group of children that got the

Â air cleaner in their home had an increase in their symptom-free days.

Â About, on the median increase was about one symptom-free day over 2 weeks.

Â So that's a positive outcome, and we might be inclined

Â to think that of course that the air cleaner works.

Â But of course, the real question is compared to what?

Â So what do we compare the air cleaner to?

Â And so the, the what we compare them to in this case, is nothing.

Â So our, our control setting where

Â the air, or the control set of houses are have

Â no air cleaner, and just kind of live their daily lives.

Â So this was a randomized control trial that looked

Â at installing an air cleaner in a child's home.

Â And, and, and installing nothing in the child's home.

Â And you can see now that in the control homes,

Â the average change in the symptom free case was about 0.

Â So there's really no change.

Â And then the average change in the air cleaner

Â homes was about 1 symptom free day for, 2 weeks.

Â And so, just now you can

Â say okay, well relative to doing nothing, the air cleaner is actually

Â a little bit better at showing an improvement in the child's symptoms.

Â So it's always important to show a comparison in a plot.

Â 2:54

The 2nd basic principle, is to show causality or mechanism or,

Â to, to make an explanation and to s, to

Â kind of show what's going on, show some systematic structure.

Â So I mean, the, the use of the

Â word causality here is not supposed to be formal.

Â But rather, is just to show kind of what

Â you believe, and how you believe the world works, right.

Â So you need to be able to show what, how

Â you believe the system is kind of operating, so to speak.

Â And so so what's the causal framework for thinking

Â about the question that you're kind of interested in?

Â And so if I might extend the

Â example from the previous slide, previously we saw that

Â if you install an air cleaner in a child's home,

Â that on average, they're going to experience a one symptom-free

Â day increase, so a better outcome in their asthma symptoms.

Â Now, we might ask, okay, well why does that occur?

Â Why is it that installing an air

Â cleaner in a child's home, improves their symptoms?

Â Well, of course, we hypothesize that the air cleaner is cleaning the air.

Â It's removing particulate matter from the air, and

Â so this particulate matter that's not, that's being removed, is no longer

Â going in the child's lungs, and no longer triggering their asthma symptoms.

Â So that's kind of how we believe things work.

Â And so we can show a plot, that might corroborate that evidence.

Â And so we can show another plot, which has the symptom free days on

Â the left hand side, and then the

Â particulate matter, on the right hand side.

Â So now we can see what was the effect of

Â the air cleaner on the, on the particulate matter levels

Â inside the child's home?

Â And you can see that for the control group,

Â there was basically no change in the particulate matter levels.

Â Maybe in the slight increase, in the, in the levels.

Â 4:24

But there was a pretty, substantial decrease in particulate matter levels.

Â For the group of homes that got the air cleaner.

Â So now we can see that not only did a child's symptom free

Â days increase when they got the air cleaner, that they're that they're indoor

Â air particulate matter levels decreased, also when

Â they got the air cleaner on average.

Â So, we can think about, you know, what is the explanation for,

Â you know, why the the air cleaner seems to improve symptoms in children.

Â And we can, and we can show, using the data that

Â we observed, that it does seem to decrease their particulate matter levels.

Â Now, of course, to really confirm that this

Â hypothesis that the air cleaner operates through, by reducing

Â PM, we'd have to do a lot, a

Â little bit more investigation and perhaps more experimentation.

Â But this graphic kind of suggests a possible explanation.

Â 5:14

The third principle that that [INAUDIBLE] talks about is to show multivariate data.

Â And the basic this rule can be boiled down to

Â show as much data on a single plot as you can.

Â And the reason is because, data are, the world is

Â inherently multivariate, there's lots of things going on all the time.

Â And if you just plot 2 variables, or maybe even 3 variables.

Â It's not going to show the real picture of what's happening in the world.

Â So if you can, you can integrate, if you put

Â a lot of data on a plot, then you'll be able to tell a much richer story.

Â So here is an example of a plot which is another air pollution

Â example, but now we're, this a, a, if data comes from outdoor air pollution.

Â So on the X axis we have, particulate matter less than 10

Â microns in aerodynamic diameter, the concentrations of those, from day to day.

Â So, every circle on this plot represents a daily concentration.

Â And on the y axis, we have

Â the daily mortality in New York City. This is for the time period 1987 to, 2000.

Â And you can see that overall, when, there seems to be

Â a night-, a slightly negative association between PM 10 levels and mortality.

Â Cause you can see that the regression line

Â that I put through there, is slightly downwards sloping.

Â So that seems interesting, cause you might,

Â one might hypothesis that higher air pollution

Â levels might be associated with higher mortality, not lower mortality.

Â 6:33

So but you can look at other variables.

Â So there's not just air pollution and mortality that

Â is kind of, kind of, that is of interest here.

Â There are other variables that may be of interest.

Â And may kind of be part of the system, and may confound this relationship.

Â So, one of the things that we can look at

Â is see, well, how does this relationship change across different seasons.

Â So if you look at, you know,

Â particulate matter and mor-, and mortality in winter,

Â spring, summer, and fall, what does that look like?

Â So, we can make a, a different plot to show that relationship.

Â And this is the plot that we can make.

Â So this plot shows the relationship between pm10 and mortality.

Â So, it's the same plot as we saw as we showed

Â on the previous slide, but it's split across the different seasons.

Â So we see it separately for winter, spring, summer, and fall.

Â And we can see that in each plot, the relationship is actually

Â now slightly positive.

Â 7:22

So within each season, the relationship between PM and mortality is

Â positive, but if you look overall, the relationship appears to be negative.

Â So this is an example of Simpson's Paradox.

Â If you may have heard it.

Â But the basic idea is that the season in which we look

Â at the relationship, is confounding the relationship between PM 10 and mortality.

Â And so when we look across with with the [INAUDIBLE]

Â we look at the relationship within each season it changes.

Â So it's important to show as many

Â variables as is reasonable at a given time,

Â so that you can sh, get a clear picture of the relationships in your data.

Â 7:59

So the fourth principle is to integrate. The evidence that you have.

Â And, and the basic idea here is that you want to use as, as

Â many different modes of evidence, or as, or displaying evidence as you can.

Â And so there's no reason to say if you're just, if you

Â have a tool that makes a plot, to only show a plot.

Â Or if you only have the ability to make a table, to only show a table.

Â You should be able to combine different modes

Â of evidence into a single presentation, to, to make

Â edit, to make your graphic or whatever display

Â that you're making as information rich as possible.

Â And so, the idea is not, is not to let the tools

Â that you use to drive the kinds of plot that you make.

Â You should make a plot that you want to make,

Â and not just let the tools do the thinking.

Â And so what, that's one of the nice advantages

Â of a system like R, because the, the tools are

Â very flexible in R, and you can make all

Â kinds of customized plots to show the data and to

Â kind of integrate different modes of evidence.

Â So this is just one very quick example from a published paper.

Â 8:56

In the Journal of the American Medical Association, looking at

Â the relationship between coarse particulate

Â matter and hospitalizations in the elderly.

Â And the basic idea is the details of this plot are

Â not particularly important, but I just wanted to show that they're,

Â there are kind of point estimates here, which are in the

Â solid circles, and then there's confidence intervals indicated by the lines

Â going through the confident, through the solid circles.

Â But then on the right hand side here you see this

Â label called posterior probability that the relative risk is created in 0.

Â And so this is a measure of the strength of the evidence

Â 9:27

that the, the kind of the association between coarse particulate

Â matter, and, and hospitalizations is, is, in fact, different from 0.

Â And so, so here, we integrate, the, the, kind

Â of, the point estimates as, as dots and lines.

Â Then we also have texts on the right which also, it shows another piece of

Â evidence, which is kind of the strength of

Â that evidence, as encoded by the posterior probability.

Â So you can use these kinds of tools to put lots of

Â information on your plots.

Â And now, I have to resort to putting diff-, putting information

Â in different places where they may be difficult to track down.

Â 10:02

The fifth principle is to det-, to describe and document the evidence

Â that you present, with using labels and sources and, and whatever, and, and,

Â and in particular, if you're going to be making a plot with, a system

Â like R, it's important to preserve the computer code that made the plot.

Â So the idea is that you want to lend

Â some credibility to the evidence that you present.

Â So, sources of where the data came from that's very

Â important and how you made the plot is also important.

Â So that's a very basic principle and it's important

Â for your credibility.

Â The very last principle of course is that, content is king, so if you don't have

Â an interesting story to tell, then there's no

Â amount of presentation that will make it interesting.

Â So when, when you're making plots, when you're making figures, and you're making

Â graphs, the first thing about what's the content that you're trying to present?

Â What's the story you're trying to tell?

Â What's the data that you have?

Â And then think about well what's the best way to present that?

Â How am I going to present it? And what is it

Â going to look like?

Â Because if you don't have very good content, then there's

Â really not much you're going to be able to do beyond that.

Â So just to quickly summarize the, the

Â 6 basic principles are, first show comparisons.

Â Always show something relative to something else.

Â Show causality or mechanism, or explain at least

Â try to explain how the system is working.

Â How the world is working, at least according to your ideas.

Â The third is to show multivariate data. So always try to show

Â more than 2 variables, because the world is complex and involve many variables.

Â 11:53

And you can, you can read about it at his website which I point to here.

Â It's an excellent book, I highly recommend it.

Â So, so that's some, these are some basic

Â principles about building analytic graphics, and then we're,

Â and, in, and in future lectures we'll talk about how to do that using the various

Â plotting systems in R.

Â