0:03

So, that's the binomial distribution. Let's talk about the most famous of all

Â distributions and probably the most handy of all distributions is the so called

Â normal or, or Gaussian distribution. The term Gaussian comes, the great

Â mathematician, Gauss. And it's kind of interesting to note,

Â Gauss didn't invent the normal distribution.

Â The invention of the normal distribution is kind of a debated topic.

Â For example, Bernoulli had used something not unlike the Gaussian distribution as a

Â probabilistic inequality not formalizing it as a density.

Â If you're interested in this, the book by Stephen Stigler on the history of

Â Statistics actually has a nice summary of exactly where and when and who came up

Â with the Gaussian distribution. But it's clear that Gauss was instrumental

Â in the early development and use of the Gaussian distribution.

Â So, a random variable is said to follow a normal or Gaussian distribution with

Â parameters mu and sigma squared if the density looks like this, two pi sigma

Â squared to the minus one half e to the negative x minus mu squared over two sigma

Â squared. And so, this density, it looks like a bell

Â and it's centered at mu. And sigma squared sort of controls how

Â flat or peaked it is. And so, it turns out that, that, mu is

Â exactly the mean distribution and sigma squared is exactly the variance of this

Â distribution. So, you'll only need two parameters, a

Â shift parameter and a scale parameter to characterize a normal distribution.

Â So, we might write that x is this little squiggle and N mu sigma squared as just

Â sort of short hand for saying that a random variable follows a normal

Â distribution with mean, mu, and variance sigma squared.

Â And, in fact, one instance of the normal distribution is sort of the kind of root

Â instance from which other sorts are derived and that's why mu is equal to zero

Â and sigma equals one. And so, we will call that the standard

Â normal distribution. It's centered at zero and its variance is

Â one and so all other normal distributions are simple shifts and rescaling of the

Â standard normal distribution. But then again, you could pick a different

Â root, maybe mu equal five and sigma equal two, but it wouldn't be quite as

Â convenient. You could still get every other

Â distribution from that one by shifting and scaling appropriately, but it wouldn't be

Â as convenient. This is the most convenient way to define

Â a sort of route of the normal distribution.

Â The standard normal density is so common that we, we often reserve a Greek letter

Â for it. So, the lower case phi we usually use for

Â the normal density, and the upper case Phi, we would use for the normal

Â distribution function. And standard normal random variables are

Â often labeled with a z and you sometimes do even hear, introductory statistics

Â textbooks and so on and refer to them as z-variables or z-distributors, or

Â something like that and that's because this notation has become so common.

Â Here's the normal distribution. It looks like a bell.

Â That's how it gets its name the bell-shaped curve.

Â And, sort of here, I've drawn reference lines at one standard deviation, two

Â standard deviations, and three standard deviations.

Â One above and, and negatives being below and positives being above.

Â Now again, here, so the, the one, right, because this is a standard normal

Â distribution, right, the one represents one standard deviation away from the mean.

Â Here, the mean is zero. One is one standard deviation away from

Â the mean, two is two standard deviations away from the mean, and three is three

Â standard deviations away from the mean. Instead of thinking of these numbers as

Â just z values in the denominator, if we think about them in the units of the

Â original data, right, and this is representing one standard deviation from

Â the mean, two standard deviations from the mean, and three standard deviations from

Â the mean, it doesn't matter whether we're talking about a standard normal or a

Â nonstandard normal. They all are going to follow the same

Â rules. So, about 68 percent of the distribution

Â is going to lie within one standard deviation, about 95 percent is going to

Â lie within two standard deviations, i.e., between -two and +two.

Â And about, almost all the distribution, about 99 percent of it is going to lie

Â within three standard deviations. We can get from a nonstandard normal to a

Â standard normal very easily. So, if x is normal with mean mu and

Â variance sigma squared, then z equal to x minus mu over sigma is, in fact, standard

Â normal. Now, you could at least, given the

Â information from this class, check immediately that z has the right mean and

Â variance. So, if you take the expected value of z,

Â you get the expected value of x minus mu divided by sigma.

Â You can pull the sigma out, and then you have expected value of x minus mu, which

Â is just zero because that's expected value of x minus expected value of mu.

Â And mu is not random so that's just mu and mu is defined as expected value of x, so

Â that's just zero. Then, the same thing with the variant.

Â If you take the variance of z, you get the variance of x minus mu divided by sigma,

Â right? So, if we pull the sigma out of the

Â variance, it becomes a sigma squared, and we have variance x minus mu.

Â And we learn to rule with variances that if we shift the random variable by a

Â constant, say, in this case, subtracting out mu, it doesn't change the variance at

Â all. So, we get a variance of x divided by

Â sigma squared. The variance of x is sigma squared so we

Â get sigma squared divided by sigma squared, which is one.

Â So, at the bare minimum, we can check that z has mean zero and variance one.

Â By the way, there was nothing intrinsic to the normal distribution that, that

Â occurred in that calculation, right? So, we've also just learned an interesting

Â fact, which is that take any random variable, subtract off its population

Â mean, and divide by its standard deviation and the result is a random variable that

Â has mean zero and variance one. In this case, in addition, if x happens to

Â be normal, then z also happens to be normal.

Â Similarly, we can just take this equation where z equals x minus mu over sigma and

Â we can multiply by sigma then add the mu and get that.

Â If we were to take a standard normal, say, z, scale it by sigma and then, shift it by

Â mu, then we wind up with a nonstandard normal.

Â You know, the top calculation goes from a nonstandard normal and converts it into a

Â standard normal then the bottom equation starts with nonstandard normal and

Â converts it to a normal. Another interesting fact is that the

Â nonstandard normal density can just be obtained as plugging into the standard

Â normal density. So, if you take the standard normal

Â density phi and instead of just plugging in z to it, say, you plug in x minus mu

Â over sigma, and then divide the whole thing by sigma, then that is exactly the

Â nonstandard normal density. And this is a kind of a way to generate,

Â just kind of an interesting aside. Here, mu is a shift parameter.

Â So, all mu does is shifts the distribution to the left or the right, right?

Â Just like whenever you subtract a constant from an argument in a mathematical

Â function. It's just moving the function to the left

Â and the right. And then, sigma is a scale factor.

Â And so, basically, whenever you take a kernel density, some density, I guess it

Â works for any density but it makes most sense to do with a density with mean zero

Â and variance one. And then, you create a new family where

Â you're plugging in x minus mu over sigma and then divide the density by sigma, you

Â wind up with the new family of densities that now have mean mu and variance sigma

Â squared. So, this is kind of an interesting way of

Â taking a root density with mean zero and variance one and then creating a whole

Â family of densities that have mean mu and variance sigma squared, they are usually

Â called location scale families. And any rate, we are only interested in

Â this case in the normal distribution and this formula right here is exactly how you

Â can go from the standard normal density and use it to create a nonstandard normal

Â density by plugging into its formula. Let's just talk about some basic facts

Â about the normal distribution that you should memorize.

Â So, about 68%, 95%, and 99 percent of the normal density lies within one, two, and

Â three standard deviations of the mean, respectively, and it's symmetric about mu.

Â So, for example, take one standard deviation.

Â About 34%, one-half of 68 percent lies from within one standard deviation on the

Â positive side and about 34 percent lies within one standard deviation below the

Â mean, So, each of these numbers split equally to above the mean versus below the

Â mean. And then, there's certain quantiles of the

Â normal distribution that are, are kind of common to have memorized.

Â So, -1.28, -1.645, -1.96, and -2.33 are the tenth, fifth, two-point fifth, and

Â first percentiles of the standard normal distribution.

Â And then again, by symmetry, so, if we just flip it around, right?

Â So if, if -1.28 is the tenth percentile, then 1.28 has to be the 90th percentile.

Â So, by symmetry, 1.28, 1.645, 1.96, 2.33 are the 90th, 95th, 97.5th and 99th

Â percentile of the standard normal distribution.

Â One in specific I want to point out that you really need to memorize is 1.96.

Â The reason it's useful is it's the point so that you could take -1.96 and +1.96,

Â the probability of lying outside of that range, right, below -1.96 or above +1.96,

Â well that's five%. So, 2.5 percent below it and 2.5 percent

Â above it, so that's five%. So , the probability of lying between

Â 1.96, -1.96 and +1.96, is 95%. And so, at any rate, it's used to do

Â things like create confidence intervals in these other entities that are very useful

Â in Statistics and people have kind of stuck with 95 percent as a reasonable

Â benchmark for confidence intervals. And five percent is a reasonable cut off

Â for a statistical test, and if you're doing two-sided, you need to account for

Â both sides and so you, you use 1.96 and then, the other fact is that 1.96 is close

Â enough to two that we just round up. So, a lot of times, things like confidence

Â intervals, you might hear people talking about, we'll just add and subtract two

Â standard errors. They're getting that two from this 1.96

Â right here. So anyway, that one in specific you should

Â memorize, but you should probably just memorize all of them.

Â Let's go through some simple examples. So, we'll go through two and you should

Â just be able to do lots of these, after I go through two.

Â So, lets take an example. What's the 95th percentile of a normal

Â distribution with mean mu and variance sigma squared?

Â So, recall, what do we want to sell for if we want a percentile?

Â Well, we want the point x NOT, but the probability that a random variable from

Â that distribution x, being less than or equal to x NOT turns out to be 95 percent

Â or 0.95. Okay.

Â And so, you know, it's kind of hard to work with nonstandard normals so the

Â probability that x being less or equal to x NOT is 0.95.

Â Well, why don't we subtract out mu from both sides of this equation and divide by

Â sigma from both sides of this equation? And on the left-hand side of this

Â inequality, x minus mu over sigma, well, that's just a, a z random variable, now, a

Â standard normal random variable. So, the probability that x is less than or

Â equal to x NOT, is the same as the probability that a standard normal is less

Â then or equal to x NOT minus mu over sigma and we want that to be 0.95.

Â Well, if you go back to my previous slide, 0.95 95th percentile of the standard

Â normal is 1.645. So, we just need this number, x minus mu

Â over sigma to be equal to 1.645 to make this equation work.

Â And so, let's just set it equal to 1.645, right?

Â And then, solve for x NOT so we get x NOT equals mu plus sigma times 1.645.

Â So now, you know, you could ask lots of questions with specific values of mu and

Â sigma. But you'll wind up with the same exact

Â calculation. And here, in fact, you know, we used 1.645

Â because we wanted the 95th percentile. But, in general, x NOT is going to be

Â equal to mu plus sigma z NOT, where z NOt is the appropriate standard normal

Â quantile that you want. And then, you can just get them very

Â easily. You know, the other thing I would mention

Â too is you should be able to do these calculations more than anything just so

Â you've kind of internalized what quantiles from distributions are and how to sort of

Â go back and forth between standard and nonstandard normals and the kind of ideas

Â of location scale densities and that sort of thing.

Â In reality and practice, you know, it's pretty easy to get these quantiles because

Â for example, in r you would just type in q norm 0.95 and then give it a mean and a

Â variance. Or if your wanted, if you did q norm 0.95

Â without a mean and a variance, it'll return 1.645 and you can do the remainder

Â of the calculation yourself, but even that's a little bit obnoxious so you can

Â just plug in a mu and a sigma. So, these calculations aren't so necessary

Â from a practical point of view even very rudimentary calculators will give you

Â normal quartiles, nonstandard normal quartiles.

Â The hope is that you'll kind of understand, you know, the probability

Â manipulations. You'll understand, you know, what a

Â quantile means. You'll understand, you know, what the

Â goals of these problems are. And you'll understand sort of how to go

Â backwards between the standard and nonstandard normal.

Â That's kind of what we're going for here. It's kind of clear, I think everyone

Â agrees that you can very easily just look these things up without having to, to

Â bother with any of these calculations. Let's go with another easy calculation.

Â What's the probability that a normal mu, sigma squared random variable is two

Â standard deviations above the mean? So, in other words, we want to know the

Â probability that x is greater than mu plus two sigma.

Â Well, again, do the same trick where we subtract off mu and sigma from both sides

Â and we just get the, the answer that that's the probability that a standard

Â normal is bigger than two. And that's about, 2.5%.

Â And, so you can see the kind of rule here. If you want to know the probability that a

Â random variable is bigger than any specific number, or smaller than any

Â specific number or between any two numbers, instead of take those numbers and

Â convert them into standard deviations from the mean, right?

Â And that can, of course, be fractional. It could be 1.12 standard deviations from

Â the mean or whatever. And the way you do that is by subtracting

Â off mu and dividing by sigma and then, revert that calculation to a standard

Â normal calculation. So, if you wanted to know what's the

Â probability that a random variable is bigger than say, let's say, 3.1, just to

Â pick out a random complicated sounding number.

Â Let's suppose you're talking about the height of a kid and you want to, you know,

Â say, what's the probability of, of being taller than 3.1 feet.

Â What you would need is the population mean mu and the standard deviation sigma, take

Â 3.1, subtract off mu, divide by sigma. Now, you've just converted that quantity

Â 3.1, which is in feet, right, to standard deviation units.

Â And then, you can just do the remainder of the calculation using the, the standard

Â normal. So, I would hope that you could kinda

Â familiarize yourself with these calculations.

Â And I recognize that, in a sense, they're kind of ridiculous to do because you can

Â get them from the computer so quickly. And we'll give you the R code that you

Â need to do these calculations very quickly on the computer.

Â But I think it's actually worth doing them by hand so just to get used to working

Â with densities, to get used to what these calculations refer to.

Â So, let me just catalog some properties of the normal distribution, a lot is known

Â about the normal distribution. And so, I'll outline some of the simpler

Â stuff, and, and some of the stuff, the letter points, we probably won't get to in

Â this class, but I thought I'd just at least say.

Â So, at any rate, the normal distribution is symmetric and it's peaked about its

Â mean, which means that the population mean associated with this normal distribution,

Â the median, and the mode are all equal right at that peak.

Â A constant times a normally distributed random variable is also normally

Â distributed. And you can tell me what happens to the

Â mean and the variance if, say, x is a normal random variable, what distribution

Â does a times x have if I'm going to tell you that it's normal, what's the resulting

Â mean and variance? It turns out that sums of normally

Â distributed random variables are again normally distributed.

Â And this is true regardless of the dependent structure of the data.

Â So, if the random variables are jointly normally distributed.

Â It's important that they are jointly normally distributed.

Â They could be independent, they could not be independent, but they need to be

Â jointly normally distributed. The sums or any linear function of the

Â normal random variables turns out to be normally distributed.

Â And again, you can calculate the mean and the variance.

Â Sample means of normally distributed random variables are again normally

Â distributed. Again, this, this is true regardless of

Â whether or not they're jointly normal and possibly dependent, or if they're simply a

Â bunch of independent normal random variables, this is true of sample means.

Â However, let me just jump to point seven. It also turns out that if you have

Â independent identically distributed observations, properly normalized sample

Â means, their distribution will look like a Gaussian distribution, not entirely but

Â pretty much regardless of the underlying distribution that the data comes from.

Â So, take as an example, if you roll a die and look at what the distribution of a die

Â roll looks like, it doesn't look like very Gaussian it looks like a uniform

Â distribution on the numbers one to six. Now, take a die, roll it ten times, take

Â the average, and then repeat that process over and over again and think about what's

Â the distribution of this average of die rolls.

Â Well, it turns out it'll look quite Gaussian.

Â It'll look very normal. At any rate, that's the rule, is that

Â random variables, properly normalized, with some conditions that we're probably

Â going to gloss over will limit to a normal distribution.

Â And that's how the normal distribution became the sort of Swiss army knife of

Â distributions is that, pretty much anything you can relate back to a mean of

Â independent things , tends to look normalish in distribution.

Â And mathematically, formally, if they're independently and identically distributed

Â in the, you normalize the mean in the correct way, then, then you get exactly

Â the standard normal distribution. That is an incredibly useful result, an

Â incredibly useful result. It's a very historically important result

Â called the central limit theorem. So, lets see, back to point five.

Â If you take a standard normal and square it, you wind up with something that's

Â called a chi-squared distribution, you might of heard of that before.

Â And if you take a standard or a nonstandard normally distributed random

Â variable and exponentiate it, take e^x, where x is normal, then you wind up with

Â something that's log-normal. Log-normal is kind of a bit of a pain in

Â the butt in terms of its name. A log-normal means take the log of a

Â log-normal and it becomes normal. It doesn't mean the log of normal random

Â variable. It's a little annoying fact, right?

Â And you can't log a normal random variable, by the way, because there's a

Â nonzero probability that it's negative and you can't take the log of a negative

Â number. The name makes it sound like a log normal

Â is the log of a normal. It's not.

Â Log-normal means take the log of mean and then I'm normal.

Â Okay. Let's talk about ML properties associated

Â with normal random variables. If you ever bunch of IID normal mu sigma

Â squared random variables, and let's assume you know the variance.

Â So, let's ignore the variance for the moment.

Â Then, the likely to associate with mu is, is written now right here.

Â You just take the product of the likelihoods for each of the individual

Â observations. And so, you wind up with two pi sigma

Â squared to the -one-half e^-xi minus mu squared over two sigma squared.

Â If you move that product into the exponent, you get minus summation i equals

Â one to N, xi minus mu squared over two sigma squared.

Â Remember, we're assuming that the variance is known.

Â So, the two pi sigma squared to the minus N over two, that you would have gotten, we

Â can just throw that out, right? Because remember, the likelihood doesn't

Â care about factors of proportionality that don't depend on mu.

Â In this case, cuz mu is the parameter we're interested in.

Â By the way, this little symbol right here, this proportion two symbol is what I mean.

Â That means I dropped out things it's proportional to.

Â I dropped out things that are not related to mu.

Â And I'll try and use that symbol carefully where it's contextually obvious what I

Â mean, what variable I'm considering important.

Â Okay, so, let's just expand out this square and you get summation xi squared

Â over two sigma squared plus mu summation xi over sigma squared minus N mu squared

Â over two sigma squared. Now, this first term, negative summation

Â xi squared over two sigma squared, again, that doesn't depend on mu.

Â So, we can just throw it out, right? It's e to that power times e to the,

Â latter two powers. So, that first part is a multiplicative

Â factor that we can just chuck. Then the other thing here is it's a little

Â annoying to write summation xi. Why don't we write that as nx bar, right?

Â Because if you, take x bar, the sample average and multiply it by N, you get the

Â sum. Okay, so, the, the likelihood works out to

Â be mu nx bar over sigma square minus n mu squared over two sigma squared.

Â So, that's the likelihood. Let's ask ourselves what's the ML estimate

Â from mu when sigma squared is known. Well, as we almost always do, the

Â likelihood is kind of annoying to work with, so why don't we work with the log

Â likelihood? We take the log from the previous page,

Â and we get mu nx bar sigma squared minus mu squared over two sigma squared.

Â If you differentiate this with respect to mu, you wind up with this equation right

Â here, which is clearly solved that x bar equal to mu and so what it tells us is

Â that x bar is the ML estimate of mu. So, if your data is normally distributed,

Â your estimate of the population mean is the sample mean.

Â That makes a lot of sense. We would hope that the result would kind

Â of work out that way. But also notice because this calculation

Â didn't depend on sigma, this is also the ML estimate when sigma is unknown.

Â It's not just the ML estimate when sigma is known.

Â So, we know what our ML estimate of mu is. Let me just tell you what the ML estimate

Â for sigma squared is. The ML estimate for sigma squared works

Â out to be summation xi minus x bar squared over N.

Â And you might recognize this as the sample variance, but instead of our standard

Â trick of dividing by N - one, we're now dividing by N which is a, a little

Â frustrating that there's this kind of mixed message that the maximum likelihood

Â estimate for sigma squared is the so-called biased estimate of the variance

Â rather than the unbiased one where you divide by N - one.

Â Now, notice as N increases, this is irrelevant, right?

Â The factor that disputes the two estimates is N - one / n. And that factor goes to

Â one as N gets larger and larger. So, I've had several colleagues tell me

Â that they would actually just prefer this estimate, this maximum likelihood

Â estimate. And their argument is something along the

Â lines of, well, the N - one estimate is unbiased but this one has a lower

Â variance. And what they mean is this is the biased

Â version of the sample variance. It's only a function of random variables

Â so it, itself is a random variable, and as a random variable, as a mean and a

Â variance. The fact that it's mean is not exactly

Â sigma squared means that it's biased. But it has a variance and its variance is

Â slightly smaller than the variance of the unbiased version of the sample variance.

Â And so, this is an example that pops up all the time in Statistics, that you can

Â trade bias verses variance. In this case, one variance estimate is

Â slightly biased, but will give you a lower variance.

Â Another one is unbiased, but the variance estimate, itself has a larger variance,

Â and it's very frequent in Statistics that you have this kind of trade off, you can

Â pick one as you increase the bias, you tend to decrease the variance and vice

Â versa. So, the other thing I wanted to mention

Â was here, we've kind of separated out inference from mu and inference for sigma.

Â If you wanted to do kind of full likelihood inference then you have exactly

Â a bivariate likelihood, a likelihood that depends on mu and sigma.

Â And it's a little bit difficult to visualize, but it is just a surface,

Â right? Where you have mu on one axis, sigma on

Â another axis, and the likelihood on the vertical axis, then it would just be a

Â likelihood surface instead of likelihood function.

Â And, it's a little bit hard to visualize these kind of 3D looking things.

Â So, there are methods for getting rid of sigma and looking at just the likelihood

Â associated with mu, and getting rid of mu and looking at the likelihood of just for

Â sigma and later on we'll discuss methods for them.

Â But for the time being, it's not terribly important.

Â What I would hope you would remember is that if you assume that your data is

Â normally distributed, then, you know, we gave you the likelihood for mu if your sum

Â sigma is known. We calculated that the ML estimate of, of

Â mu was, in fact, x bar, and that the ML estimate of sigma squared was, you know,

Â pretty much the sample variance. You know, off by a little bit from the

Â standard sample variance, but pretty much the sample variance.

Â And then, you know, the ML estimate of sigma, not sigma square but of sigma, is

Â just the square root of our estimate for, ML estimate for sigma squared.

Â Well, that's the, end of our whirlwind tour of probably the two most important

Â distributions. There are some other ones that we'll cover

Â later. Next lecture, we're going to travel to a

Â place called Asymptopia. And everything's much nicer in Asymptopia,

Â and so I think you'll quite like it there.

Â