A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

136 ratings

Johns Hopkins University

136 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay. Now let's take a look at some exercises intended to review some of the salient important points of lectures A through D.

So what I'm going to have you do here, I'm going to show you some information and then give you some questions, and I'll suggest you pause this lecture and work the questions out on your own. And then compare them. You hit play again and I'll go through my answers or my take on them.

So, let me show you, let's go back to this Philadelphia data that we looked at before, and what I want to look at here is the distribution of the daily particulate matter measured in micrograms per millimeter cubed. And, this is the, for all days between 1974 and 1988. this is distribution of the TSP data, this histogram here shows it. And the mean of this sample of multiple days is, 67.3 micrograms per meter cubed.

The standard deviation is 26.9 and the median is 63. Here's another representation in the box plot format of the same data.

Now let's also look at the death counts per day over this period from 1974 to 1988. So this is a histogram of the distribution of the number of deaths across these days, and the mean is 46.7 deaths. The standard deviation is 8.4 deaths. And the median is 46 deaths. And here is a box plot representation of these deaths data. So some questions I'd like you to think about. And then come back to me with your answers. And see what, how they compared to what I've thought of. How would you characterize the distributions of the daily TSP? Total suspended particulate readings and the daily death counts from Philadelphia for 1974 to 1988. Based on the information at hand, can you give an estimate of the 25th and 75th percentiles for TSP?

Suppose you want to measure the association between death and TSP using these data. To start, perhaps you wish to create four categories of TSP based on the original continuous variables.

to have similar numbers of observations. You want these four categories to have similar numbers of observations. Can you suggest a way to do this?

And finally, suppose you were only allowed to use 300 randomly selected TSP measurements from those total sample of values of the days from 1974 to 1988? How would you expect the histogram of these 300 values to compare to the original histogram presented, which contains over 4,000 values?

Okay, let's, let's take a look at the questions I posed and my suggested takes on the answers. So, I ask you, how would you characterize the distributions of the daily TSP values and the daily death counts for Philadelphia. Form 1974 to 1988. Well let's look at TSP, total suspended particulate levels. Well first of all we can see from these data, just a numerical summary wise, that the sample mean of 67.3 micrograms per meter cubed, is larger than the sample median of 63 micrograms per meter cubed. Furthermore if we look at these, the histogram presentation or the box plot, it's pretty clear in my opinion.

That there is evidence of a right or positive skew in both pictures, but especially in the box top. You can see there's a fair amount, or what seems to be a fair amount, of outlying values and they're all larger than the rest of the data. Which would be indignative of a positive right skew. So I would say that all things considered, this daily TSP distribution is pretty clearly a right-skewed or positively skewed distribution. How 'bout for deaths? Well, this is a little more subtle, and we may have different opinions on this. And there's not exactly one right answer If you look at a comparison of the sample mean and the median. The sample mean is larger than the median, but albeit slightly. 46.7 deaths is the mean versus 46 deaths which is the median.

The histogram, if we look at the histogram, there's probably several different opinions across your classmates and myself, and you. [LAUGH] If you look at the histogram, it, some people may say this is a relatively symmetric distribution.

If you look carefully it is very hard to see especially because coloring and sizing there is a bit of a right tail but it's not nearly as evident visually as it was with the TSP data. So the histogram is a little bit of a mixed message it depends on how much you can see and what your interpretation is. So somewhere from skewed to symmetric will be brought again in the answers. I think in this case, the box plot is a little more informative in terms of characterizing whether there's any skewness to this where it, because we can see that, well, the middle 50%, the 25th to 75th percentiles, relatively symmetric about the median. And the non-volume values, largest and smaller, are relatively symmetric around the sides of the boxes. We do have some positive outliers, which makes it appear to be more, a little more right skewed than symmetric. But certainly there are a bevy of opinions on this, and it's not as clear cut as it was, in my opinion, with the TSP. Association. So then I'd ask you to give an estimate of the 25th and 75th percentiles of the TSP distribution. The only way to do that from what I've given you in these slides was to look at the box plots, and of course, your estimates are going to be approximations based on the visual cue here, and it's very hard to see, given the size and detail of this. But we do know that on the box plot the box in the middle, the lower-valued side if you will, corresponds to the 25th percentile. And the upper side, or higher-valued side of the box corresponds to the 75th percentile. And so if you're eyeballing this, and I'm not very good at doing that on this scale, it looks like the 25th percentile is about 50 micrograms per meter cubed and the 75th is on the order of maybe 80. This is where, if we actually wanted to answer to this question unequivocally, we could go appeal, if you actually did pull this from the computer, the 25th percentile for these data is slightly lower than what I had eyeballed. It's 47.5 micrograms per meter cubed, 75th percentiles at 82 micrograms per meter cubed. Suppose you want to measure the association between death and TSP using these data. To start, perhaps you wish to create four categories of TSP Based on the original continuous values. You want these four categories to have similar numbers of observations. Can you suggest a way to do this?

Well one possibility would be to take these continuous measures and break them into categories based on their percentiles. And if we wanted roughly equal numbers in the four categories, we would break this into four equal sized percentiles, so one way to do this would be to look at putting these into what are called core tiles, categorizing them based on their relative position to the 25th,, 50th, and 75th percentile in these data. So, going back to what I talked about before, we know that 25th percentile is 47.5. The median, or 50th percentile is 63, and the the 75th percentile was 82. So, what we could do is categorize each of the individual TSP measurements, as their membership in one of these four quartiles. So, for example. For days in which the value was less than or equal to 47.5 micrograms per meter cubed, we'd put them in category 1, the first quartile.

For days in which the values were greater than 47.5 micrograms per meter cubed but less than 63, we put them in the second core tile etc, etc and so roughly, 25% of the observations or quarter would be in this first core tile, another 25% goes from the 25th to the 50th percentile, would be in the second core tile, etc.

So let me ask you this. Let's go back and think of the TSP data for example. Why does the mean tend to be larger than the medians for samples of right-skewed data? Well let's think about this. Let's think about a situation. Suppose we started with a distribution. Suppose we measured TSP incorrectly. And we graphed the values we had for these days, and it was, it was roughly symmetric. I'm going to even make it bell-shaped here, and it looked very symmetric around its sample mean, and then, suppose somebody came in and said, well, John, actually, these measurements over here are wrong.

You've actually gotten lesser values than you should've. I'm going to put in the proper values. And it stretched our tail out to here such that now we had a right-skewed distribution.

This value with it, we had highlighted before corresponded to the mean, it also corresponded to the median when this data was symmetric. What happened to the relative position of the median when we redrew this to include the right tail. Well the middle value stays the middle value. Right, so it's hanging out there. However, the mean which I actually haven't shown visually is going to be effected by this increase in right tail value. So the mean is actually going to be pulled up further to the right while the median remains unaffected. So in other words, another way to say this without going to my visual is that the mean tends to be larger than the median for samples of right-skewed data because the mean is more heavily influenced by the larger, positive values that occur in that.

Now, finally, suppose you have categorized the T-S-P values into four quartiles, like I talked about, where category one has the values less than or equal to the 20th percentile, et cetera, like we detailed two slides back. You now want to compare the distributions of deaths. Across these different TSP categories. How could you do this visually, and how could you do this numerically?

To do this. Especially with these much data. This much data, but some possibilities. Stack histograms. So what I have here is the histograms of the total daily deaths by TSP Quartiles, so this histogram here on top, and I've done this through the computer, this histogram on top is the distribution of death counts for days in the lowest TSP quartile. This second graph here is the distribution of death counts for days whose TSP values were in the second quartile between the 25th and 50th percentile. Center, and so you sort of, it's hard to tell what's going on in these pictures. These histograms look relatively similar, but there's a lot of data in each and the scaling is small.

So, another way to compare these head-on, that maybe gives a little more insight is by looking at box plots... Of these death values side-by-side across the 4STSP quartile value. So here is a box plot showing the distribution of death counts on the lowest TSP days, those in the first quartile. Here's a box plot showing the distribution on the second quartile, those between the 25th and 50th percentile TSP, this is the box plot of the distributions of death.

The third quartile and the fourth. So what do you see in these pictures? Well, of course pictures are subject to interpretation, but at least to some degree, you probably noticed the increased variability in the upper or largest values of deaths as we increase across the TSP quartiles. You can see a slight increase visually perhaps in the medians as well, but that's more difficult to know.

Well, again, you'd certainly need a computer to do this, but some possibilities include comparing the medians of death across the TSP quartile groups, comparing other percentiles, like the 95th percentile or the 15th percentile or something of that nature.

the one that is used so often in the literature, and we're going to spend. More time on this course is actually comparing the means.

So for example, I'm going to report the mean number of deaths in each of those four samples of deaths distributions for each TSP core tile category. So in the first category of TSP, the lowest TSP days, those with values between the lowest value in the 25th percentile. The average number of deaths was 46.2. In the next quartile, the 25th to 50th percentile, the average number of deaths on those days was 45.9. In that third quartile, those with TSP values between the 50th and the 75th, the average number of deaths was 46.6, and then finally on the highest particulate matter days, those with values between this 75th percentile and the largest value on the data set, the average death count was 48. So what we might do to present these comparisons and quantify the difference in values is we might choose one of these four groups as the reference, and then compute the difference between the mean for the other groups and the same reference, so that these differences are comparable, and what we're driving here towards is how we might present these in publication. How many put uncertainty values in this, which we'll get to next, and how we might present the comparison after adjusting for other factors that differ between these days other than TSP. So, it's just setting up, as we go along the course. So, one way to do this might be to declare, well, let's declare this the lowest TSP data being reference, and then we'll compute mean differences in the number of deaths. Between each of the other TSP day categories and this same reference. So difference number one might be the difference between the deaths, mean deaths on TSP category two, those days between the 25th and 50th percentile of TSP minus this deaths on, average deaths in the lowest category of TSP. So that might be 45.9 minus 46.2, which is mean difference of negative 0.3 deaths so on average those days when the 25th percent, 50th percentile of TSP has slightly lower deaths on errors by about 0.3 deaths, we then did this for the third category compared to the same reference.

For some reason I have trouble writing S on that second round here. it would be 46.6 minus 46 point, that same 46.2, that reference. 0.4 deaths. 0.4. So it suggests that on average, those days with TSP limits between the 50th and 75th percentile have 0.4 more deaths on average than days in the lowest TSP quartile. And then finally if we did the TSP 4 The highest TSP levels to the lowest, I won't write out the entire thing here. But the difference is 1.8, those highest particulate matter days had 1.8 deaths more on average than the lowest. These differences are all filtered through the same reference group. Of the lowest Tsp. So these differences are comparable to each other as well. So for example, using these differences only, I could get, for example, the estimated difference in average number of deaths for the Tsp 4 days, the highest Tsp compared to the third quartile, by taking that 1.8 Which is the difference between TSP4 and TSP1 and subtracting that 0.4, which is the difference between the TSP3 and TSP1. And what would fall out of that is the difference between the 4th quartile and the 3rd.

So one last thing I wanted to ask you and review was, and this is a question that comes up often, and you may have already asked it or heard somebody else ask it, and it's a reasonable question. Both the formula for the sample mean and for the formula for the sample standard deviation include the sample size n in the denominator. Given that this is the case why do neither the mean or standard deviation systematically decrease with increasing sample.

So in other words we know that the formula for the mean is the sum, sum of all the values in our sample divided by the sample size. And you might say well John The denominator here is increasing as our sample size gets larger. So why is not this quotient going down in value? Well it turns out, you have to remember something that's easy to forget, is that as the sample size increases, we're also increasing the number of things we add up in the numerator, so that's increasing as well with increased sample size. So this ratio isn't necessarily decreasing systematically, because both parts are going up. And the same logic applies to the estimated sample standard deviation.

I'll just write out the formula here. And I for you to better looking typed versions in previous lectures. But if I do this, yes. As our sample size increases, the denominator of the square root ratio is increasing. But again, we're also increasing the number of differences we add in the numerator, so. This ratio in terms of sample size, is being kept in somewhat of a steady state, and it won't necessarily go down just because the denominator is getting larger, cause the numerator is also increasing as well. Okay well hopefully you found these helpful and some things to think about as we move on in our quest for a statistical domination.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.