A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

138 ratings

Johns Hopkins University

138 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 2A: Summarization and Measurement

Module 2A consists of two lecture sets that cover measurement and summarization of continuous data outcomes for both single samples, and the comparison of two or more samples. Please see the posted learning objectives for these two lecture sets for more detail.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings, and welcome back! This is going to be our third lecture in the Statistical Reasoning One series, and today were' going to talk about a famous, some might say infamous, distribution called the normal distribution. Many of you have heard of the normal distribution. You may even be familiar with some of its key characteristics. It's bell shaped, it's symmetric around its center. And the tails die off quickly. In other words most of the observations that are described by a normal distribution. Fall close to the center of the distribution. Now we're going to spend a little time trying to understand the properties of it. You might say but why are we doing that? Is it because most data that we'll see in public health and medicine is normally distributed? And the answer is no, not necessarily. We'll see, for some types of data, types of continuous data, the normal distribution is a reasonable working model. And we can use its properties to better flesh out the distribution of the data from the population from which the sample we have is taken. But in other situations, these properties that are specific to the normal curve aren't going to get much ground. However, when we focus on our next unit statistical estimation of confidence regions and inference the normal distribution is going to prove invaluable. So far, and up to this point, including in this lecture set, we take our estimates from samples as is, that is, we look at a sample mean and we say this is our best estimate of some underlying population truth. And we know it may not be exactly equal to that unknown underlying truth. Well, in the next set of lectures post this, we're going to get into the idea of, can we put uncertainty bounds on this estimate to get a rating of possibilities for this unknown truth. And that's where the normal distribution is going to be invaluable.

So, what we're going to do, there's three sections, didactic sections to this lecture and then one set of practice problems. And what we're going to do is first define the properties of the normal distribution, and show how we can define it perfectly just by knowing it's center or it's mean. And the spread of the values under the distribution, the standard deviation. And there's some general rules about it. In the second section, section B, we're going to look at some data examples where the normal distribution is a reasonable model for the individual observations in the population from which our samples are taken. And show how we can exploit the properties to better understand the underlying population distribution. In Section C we're going to see, well you know, sometimes in many cases the data we get in public health and medicine is not well described by this perfectly symmetric, theoretical distribution. And we'll see that if we actually apply the properties of this said normal distribution to these data. We're going to end up with useless results. And it's just to remind you that, you know, some things are only applicable under certain conditions. So in this section, this first section A, we're going to actually define some characteristics of the normal curve, and hopefully upon completion of the lecture, you'll be able to actually describe the basic properties of the normal curve.

Describe how any normal distribution is basically completely defined by its mean and standard deviation. Recite what I'll call the 689599.7% rule for the normal distribution, with regards to standard deviations.

And hopefully feel comfortable or beyond your way to feeling comfortable working with standard normal tables. Now I'll be honest, I'm not going to require a lot of that in this course, an it's not a big focus, but it's, gives you some appreciation for how quickly, observations fall in their likelihood under a normal curve the further you get away, from the center.

Okay. So let's get started. The normal distribution is a theoretical probability distribution that is perfectly symmetric about its mean, and because it's perfectly symmetric about its mean, the mean, median, and mode are the same, and it has a bell-like shape. So frequently you also hear it referred to as the bell curve.

Where does this distribution come from? Who invented it? Well, the normal distribution was also called the Gaussian Distribution in honor of its inventor, Carl Friedrich Gauss, and here's a picture of the man himself. For those of you who either lived in Germany,

or were in Germany pre-euro you may recognize this picture because Carl Gauss's feature was on the deutschmark. And so in the US, we're hoping to, you know, make the presidents move over and get some statisticians on dollar bills, and maybe you'll see me on the ten someday.

Normal distributions are uniquely defined by two qualities. All we need to know, if we know data comes from a normal disribution, if we want to completely characterize the distribution of that data, all we need to know is its mean and standard deviation. I'll generically represent these with the symbol mu, and standard deviation sigma to imply a population level mean. there are literally an infinite number of possible normal curves for every possible combination of the mean, in standard deviation. So here I'm showing some pictures of curves that have different means and different standard deviations. You could keep adding to this add infini-, make t hem wider, skinnier, at different centers. But you'll notice that these three different examples I have here, any other examples of a normal curve would all have the same proportional structure, that is that they're. Uniquely excuse me, they're centered about their mean, and evenly distributed about. Okay, so for this next slide, I'm just showing you this, not to scare you off of math. Many of you like math and are comfortable with it. But if you're not comfortable with it, don't worry about this. I just want to show you sort of the beauty of mathematics, and I get to have the opportunity to have it over my shoulder, which is always a nice perk. But I want to show you, for any given value under a normal curve, the proportion of values that take on that number, the probability of observing a value equal to that is described by this function here. And this function is sort of a math majors dream in some sense, it's got all kinds of symbols and notation in it. Two of the symbols I want to point out are the pi symbol, which actually represents a constant, a number roughly 3.14. And also the e, which also represents a constant, or number, called the natural constant of 2.718 or so. So, once we deal with those constants, the only other two symbols in here are the mu and sigma, and the only reason I'm showing this equation to you is to make you appreciate that this curve was completely specified. We can figure out where a particular value falls under the curve, only by knowing that value and the mean and standard deviation of the distribution it comes from.

So again, all normal distributions, regardless of the mean and standard deviation values, have the same structural properties, mean equals median equals mode, the values are symmetrically distributed about the mean, and values closer to the mean are more frequent or likely. Than values further from the mean. The entire distribution of values described by normal distribution. Again, I've said this before but it can be completely specified by knowing just the mean and standard deviation. Since all normal distributions have the same structural properties. we can use a reference distribution called the standard normal distribution to elaborate on some of these properties. And we'll define the standard normal distribution in a minute, and in section B we'll show that any normal distribution with any mean standard, so the deviation can easily be scaled to this reference distribution. So, here's the first one. This is just something you are going to have to memorize. The only characteristics of the normal curve that I want you to take to heart, and you can always look these up in the table, but hopefully you will be able to internalize these pretty quickly. So, I'm just telling you, and I'll show you where this comes from, but if I'm dealing with a normal distribution, regardless of the mean and the standard deviation. If I'm standing at the mean, in the center, and I go one standard deviation either direction of that center, I encapsulate 68% of the observations under that curve. So this shaded red area here is 68% of the entire curve.

There are several ways to actually state this. We could say for data whose distribution is approximately normal, 68% of the observations fall within one standard deviation of the mean. We could also say the same thing just rephrasing it in terms of a probability is, the probability that any randomly selected value is within one standard deviation of the mean is 0.68 or 68%. Those are two ways of saying the same thing.

Let's get to the second part of this rule. This is one you may be familiar with, but 95% of the observations under a normal curve fall within two standard deviations of the mean. Truthfully, it's 1.96. Computers will use that number. You can look it up in a table, but for quick and dirty. Back in the other compilation, computations is absolutely fine to use, too. So if we're staring at the mean of a normal curve. And we go two standard deviations above and two standard deviations below, we'll capture 95% of the curve.

And if we actually go three standard deviations from that center, we'll capture 99.7% of the observations they fall within. So and almost all the values that take on a normal distribution fall within three standard deviations of the center of that distribution.

Okay, so let's just consider this for a moment. If we say that 95% of the observations fall within two standard deviations of the mean, then again, it's really 1.96 but we'll work with two.

Okay. Let's just consider this for a moment. Let's think about this for a moment. So, what would that mean about, the proportion of observations that are more than two standard deviations above the mean. Let's think about this. Can we use the logic of the normal curve and it's symmetry? We're encapsulating 95% in the middle, this red area. So that means the entire area of the curve would be a 100%. So that means what we haven't covered in this middle territory com, encapsulates the total of 5% of the observation. Right? And because the curve is symmetric, that 5% that we have en-captured in that middle 95%, should distribute itself equally on both sides. So, the proportion of observations that are greater than two standard deviations above the mean, is half of 5% or 2.5%. Similarly, the proportion of observations under a normal curve that falls more than two standard deviations below the mean, is also 2.5%. So just to recap what we've done with the number two standard deviations in reference to the normal curve, we've said that the middle 95% of values that take on a normal distribution fall, within two standard deviations of the mean. They fall within the interval, mean minus two standard deviations, and the mean plus two standard deviations. If we were randomly sampling data points from data that followed normal distribution the probability of getting a value in this interval would be 95%. So what does this mean in terms of percentiles? Well, let's look at this. The lower end point here, is the point which is 2.5% of the values under the distribution are smaller than, and hence a 97.5% are greater than. So, the 2.5th percentile of the normal curve is equal to the mean minus two standard deviations. Conversely, 97.5% of the values are smaller than, and 2.5% are greater than this upper end point of mu plus two standard deviation. So, this upper end point is the 97.5th percentile of the normal curve.

If we were to actually line up in order, from smallest to largest, all values under. That follow a normal distribution actually pick off, empirically, the 2.5th and 97.5th percentile, they would closely correspond to these estimates based only on the mean and standard deviation.

Let's again look at the 68% part for a minute. We know that 68% of the observations in the normal distribution are in the interval within one standard deviation, 68% here. ' Kay. So let's just think about this for a minute. What percentage of the observations that following a normal distribution are more than one standard deviation above the mean. They can also be phrased, what is the probability that an individual observation is more. Than one standard deviation above the mean, normal distribution. Well, what are we talking about here? We're talking about this area here. Let's just see if we can figure that out using the logic of the normal curve. Well, we know from the rule I've given you that 68% fall in that red area within one standard deviation. So the total outside of that red area on either side, is a hundred percent of 68% which is 32%. But of course by symmetry of the normal curve those two tails here which total contain 32% of the distribution will contain it equally. So, in each of these tails, roughly 32% divided by 2, or 16% of the area, percentage observations fall. So 16% of the observations that take on a normal distribution are beyond one standard deviation above the mean. If we wanted to look at what percentage of observations fall in the normal distribution, or more than one standard deviation away from the mean in either direction, either above the mean or below the mean. But we've sort of already answered that, but that would be the percentage beyond on standard deviation in either direction is that 100 minus 68%. For that 32%. So where did this rule come from? Did I just make this up?

No, in other words, how did I know these relationships? Okay, well, it turns out that there are tables that exist for this. The actual figuring this out, it would be difficult to do with that formula I presented you before because it would require integrating all of the ranges of the data. So it's nice that people have come up with tables for us to look at, and you might say, well, that's great John, this rule is useful, but what about other percentages under the curve for other standard deviations? Distances from the mean, you know? Not just one, two, or three. Well, all the information I quoted and much more can be found in what's called the standard normal table. So here is an example of the standard normal table. This is just maybe the greatest hits of a table just to get you thinking about it. me-, may, the tables represent themselves in different ways, depending on where you find one. And we'll speak more to this in a minute. But you'll notice I've got three columns here. Most of them will only include one of these. Descriptions, but we've shown already by the logic of the normal curve, its symmetry, et cetera, if we're given one piece of information about a standard deviation in area under the curve, we can figure out the other bits by employing that logic. So this table here actually has three columns, most tables won't be so ornate, but in this first column, it just shows you what percentage falls within. Z standard deviations of the mean, Z is this column here, so for example, if we're looking at one standard deviation, we can see that 68%, I mean I rounded it in my lecture, but it's really 68.3% fall within one standard deviation of the mean. Another way of saying the same thing, is if we were to go to one standard deviation above the mean, and look at the percentage of observations that are greater than that. It would be, we've already shown the logic of that 16% that we showed before. And if we were tp actually look at the percentage that are outside of the middle. One standard deviation range it would be that 32% we showed before. Similarly you could check this for different numbers, you could see that for two standard deviations. Well truthfully its 95.5% that fall within.

Two standard deviations, but again, we're going to, and, and 1.96 is the cutoff for 95% but, when we work back of the envelope calculations, you could think of two as the number that cuts off 95% of the middle and a total of 5% outside of that range.

So, you know, you might say well where do I find one of these standard normal tables in case I need to do this. Well, reminder, we're in an online course, which means you have access to what? [LAUGH] The internet. And if you type in standard normal table on the internet, you can get multiple hits. You can even find calculators where you can plug in a number of standard deviations of interest and it will tell you something about the curve. And so I'm just going to show you two examples of tables just to work the logic of these. You have to, you can also find these in the back of any statistical textbook. But there are many ways to tell the story of this same curve, and so you have to pay heed to what a particular table is telling you. So this is one I went and searched on standard normal tables. This is one of the hits I got. Here is the URL. Hopefully it's still working by the time you look at this lecture, but if not, there's multiple other ones. Clearly you can't see this on the slide, so I'm going to zoom in a little bit. Move over to the side and just let me peh, show you what it's telling you wi, with reference to the values in the table. For any given, you have to pay attention to the fine print, for any given standard deviation what this table is going to tell us. Is the percentage of the observations that fall

So it's not telling us about the full range within that, only part of what we looked at before. Well we'll see if we can use that to map to numbers for comfortable width. So for example if I go to this table, let's just see what it tells us about some of the numbers we know. Let's go from 1.96 just to, to, to be exact when we're looking at this table. So the way to follow this is you see its got this column here called Z,

And this goes in tenths, intervals of a 1 10th of a number. And then this other column that goes in hundredths. And the way to piece this together, is that the root of the number we're looking for 1.96, we're going to look for the value 1.9 in this column here, and then where it intersects the value of 0.06 in the column over here. So if we look at this, if we go 1.96, and I'm just going to circle this and highlight it, the value we're given here for 1.96 is 0.4750. So let's see if that makes sense. Remember 1.96 is the number we say literally cuts off. 95% in the middle. Does this information jive with what I've told you. Well let's see what we're looking at. What are we looking at with this number?

We are looking at, here's a normal curve, sorry about the slant there but, and it's telling us that within a normal curve.

If we go 1.96, or you can think of it as two standard deviations from the mean, two positive 1.96 deviations, that cuts off .475, or 47.5% of that cure. Does that jive with what we've said? Well let's think about this. By the symmetry of the normal curve. If we actually go 1.6 standard deviations below the mean, what should that area be? That should also be 0.475 or 47.5% and the sum of these two is 0.95 or 95%. Okay, so if we have this one piece of information about the upper half, encapsulated by going that far above the mean, we have the story for the rest of the curve. And now we can also figure out, you know, the 5% remaining thread areas is equally distributed so this would be 2.5%. And you can do this for any value. And we'll look at some other values in our next lecture set. And then here's another exhibit, here's another table we got. Okay, just by searching the interwebs. And you'll see what this tells you, you're going to pay attention to what the table's telling you is, for a human standard deviation value, it's telling you something slightly different than the previous table. Instead of telling you how much falls between that. And the mean, it tells you what percentage of the curve or values are below that number of standard deviations away from the mean. Okay? So let's see if we could use this. So here's a, here's first snippet but it's still kind of hard to read. So, let's cut to this, just to give you an example if we were looking at so, five. From this table here, we have that Z column.

So, this is the similar to the previous table, and then the hundredths unit here, so we can get down to the second decimal place in terms of standard deviations. So, I'm just blowing this up here to look, if we wanted to look at the story of one standard deviation. I'm just showing you a piece of the table. Where the Z column is at negative 1 because this only has negative values in it. And hundredths column is 0. So, what is this telling us? It tells us, if we are under a normal curve, and we are at, here's the mean, for a 1 standard deviation below the mean, then the percentages of observation that are either further away in the negative direction, or less than 1 standard deviation. bu-, more than one standard deviation below the mean is 15.87%, or that's what we'll round to be 16%. And so once we have that, we have the entire story of one standard deviation, right? We know by symmetry what the If we went one standard deviation above the mean, we'd also get 16%, so the total area in these two portions is 32%, which must mean that's what's in the middle is 100 minus 32%, or 68%. Okay, so let's just think about this for a minute.

What have we covered here? We've defined the normal curve, showing that it's symmetric and bell shaped. We've shown that it can completely be defined by knowing its mean and standard deviation, and that most of the observations in, of, that, for [INAUDIBLE] that follow the normal distribution fall within two standard deviations to the center. Although the tails go on infinitely, the majority of the data, 95% is encapsulated within that range. We've also gone into looking at how to use a table to find these respective ranges, and cutoffs, and we'll do some more examples of that in the subsequent portions.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.