Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

41 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Hypothesis Testing

In this module, you'll get an introduction to hypothesis testing, a core concept in statistics. We'll cover hypothesis testing for basic one and two group settings as well as power. After you've watched the videos and tried the homework, take a stab at the quiz.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay, so as a, as a bit of an aside since we're talking about a

Â paired data I thought maybe I'd talk for a minute about regression to the mean.

Â Because it's sort of a historically famous topic, and it, and it

Â involves one of the sort of

Â eminent characters in the discipline of statistics.

Â That person's name is Francis Galton.

Â And Francis Dalton is the cousin of Charles Darwin who invented quite

Â a few topics and statistics and he was the first to

Â mention, to recognize this phenomenonon that when you have match data, high

Â initial observations tend to be associated with With lower second observations,

Â and low initial observations tend to

Â be associated with higher second observations.

Â So the example that he gave was, sons of very tall

Â fathers tended to be a little bit shorter.

Â Still tall, But tended to be

Â shorter, and paradoxically not paradoxically, but seemingly

Â paradoxically fathers of very tall sons tended to be a little bit shorter.

Â And as an example from what we were talking about

Â today, second exams for those who scored very high on the

Â first exam tended to be a little bit lower.

Â Whereas first exams for those who scored very high on

Â the second exams tended to be a little bit lower.

Â So let's talk a little bit about why this phenomenon occurs.

Â OK, so the reason this occurs is, imagine if the tests were completely random

Â Then with and and the students were id draws from

Â that distribution. so the probability, so the

Â highest observations on test one were just random observations.

Â So, the probability of a second observation being that high is quite low.

Â It's, it's more likely to be in the, in the, in the center of the distribution.

Â Conversely a very low test, something that had a very low probability

Â of occurring given that it's already low, the probability

Â of a second test being that low is, is small.

Â And so if if its perfect or if its if its exactly noise if the

Â pairs of observations are exactly noise then then you'll

Â get a lot of regression in the matrix Let's consider the other extreme.

Â Let's imagine if the only, the test was a perfect adjuticater of students abilities.

Â Then, and it was perfectly calibrated instrument.

Â And there was no noise.

Â Then the student should ideally get exactly the same score on both exams.

Â At which point there'd be no, variation around an identity line on

Â this plot here that says test 1 by test 2. Okay?

Â Now, those are the 2 extremes.

Â One is complete variation, and no, no trend.

Â And the other is 100 percent correlation.

Â basically all trend, where the test was a perfect instrument.

Â And of course every practical case

Â lies somewhere somewhere usually somewhere in between.

Â so you know here's as an example of the eight people who

Â got below 80 on the test but one did all but one

Â did better on test two and of the five people who got

Â above a 95 on test one three did worse on the second test.

Â so, it's not tremendous amount of regression to the

Â mean, but, but some, some certainly is there, and

Â here, I, I draw the identity line.

Â Okay, so, let's discuss this phenomena a little bit more.

Â and we're going to assume that the data's been normalized.

Â So what does that mean?

Â That means we want the data to have mean zero variance one.

Â So in this case the mean of the first test

Â was 87, the mean of the second test was about 90.

Â And the standard deviation of the first test, and the second test

Â were both about six, so for every first exam we will have taken the

Â exam subtracted off 87 and divided by 6, and then for every.

Â Tests second test we would've subtracted 90 and divided by 6.

Â So, you know, and of

Â course we would've done it with the exact numbers,

Â not, not just the rounded numbers like I'm suggesting.

Â And then in that case we'd have the, the empirical mean for test one would be zero.

Â And the empirical standard deviation for test one would be one.

Â And the empirical mean for test 2 would be 0,

Â and the empirical standard deviation for test 2 would be 1.

Â So, so, so that gets rid of any sort of shift

Â effects, or scale effects.

Â Now when we're talking about testing, about the paired

Â data, we were exactly looking at shifts in the mean.

Â So here we've gotten rid of the

Â mean information by recentering the data at zero.

Â so if there is no regression in the mean,

Â the data would just scatter about an identity line.

Â and well if there was exactly no

Â regression in the mean it would fall exact-,

Â it would fall perfectly on an identity. Main line.

Â but the more scatter there is about the

Â line, the more regression in the mean there is.

Â So the best fitting line goes to the it actually goes to the

Â average and since we normalized our data, it goes to the point zero zero.

Â And it has slope. I wrote it out here.

Â the correlation between test on and test two.

Â Times the ratio of the standard deviation of

Â test two to the standard deviation of test one.

Â Which we normalize the standard devia, test two and test one.

Â So the standard deviations are exactly one.

Â so the, the, the best fitting line goes to

Â the has slope correlation test one by test two.

Â okay.

Â Okay, so just, just re, re, rehashing

Â something from the previous line, no previous slide.

Â The best fitting regression line has slope correlation Test1, time, correlation of

Â Test1 and Test2 Which in general is going to be less than one.

Â In a case where its one there isn't much statistics left to do.

Â So this will shrink, this will be shrunk towards a horizontal

Â line telling us our expected normalized score for test two will

Â be this correlation times the normalized test one score.

Â So if the correlations is 0.

Â 95.

Â Then your estimated test 2 score will be

Â point 0.95 times your estimated test 1 score.

Â And this line sort of appropriately adjusts

Â for regression and mean for test 2.

Â Conditioning on test 1, or equivantely test 1 conditioning on

Â test 2, if we knew your test 2 score, and

Â we wanted to guess what your test 1 score was.Uh, normalized Test2 scores if we

Â wanted to guess what your normalized Test1 score

Â is, we would multiply by the same correlation.

Â On our plot, the slope of the line will be Test2 we'll, we'll have this slope

Â correlation Test1, Test2 where I'm assuming test two

Â is the vertical axis and test one is the horizontal axis.

Â If we wanted to slope going in the other direction,

Â we, the, the, the slope would obviously be the, the inverse.

Â I don't have a obvious that is but it, but it, the, the slope would be the inverse.

Â So, just to rehash.

Â In either case, if you want to adjust for regression

Â to the mean, you're multiplying the test, the normalized

Â test that you have.

Â And the one you would want to predict, you would multiply it by that correlation.

Â But let me just show you a plot here.

Â This, this'll probably make it a little bit easier.

Â So, here, I have Normalized Test 1 score on the

Â vertical, on the horizontal axis, and Normalized Test 2 score

Â on the vertical axis, and the slope of this line

Â here is the correlation between the two, which is about 0.21.

Â And that's the regression line of test one on test two.

Â I show an identity line here in the middle.

Â And then if you wanted to do the, to do the same thing whe,

Â a, as if you had had no, test one on the On the vertical axis.

Â Then you would use this li-, this almost vertical-ish looking line here.

Â And that's just the inverse.

Â That has slope inverse of this correlation here.

Â Notice both these lines pass through the point, zero zero.

Â Okay?

Â And this phenomena. And so, so this,

Â this line.

Â Notice how it's, it's, it's, it's, it's, you know?

Â When, when now we're looking at test 2.

Â And we want to predict our test 2 score from our test 1.

Â Is very flat, right?

Â It's very a flat, suggesting that there wasn't

Â a lot of correlation between the two tests.

Â And that amount of noise means that there's fair

Â amount of regression to the mean and so that if

Â we want to predict out test two score for our

Â test one score, were shrinking it quite a bit and

Â then you can see this dissenting in the other direction

Â that it becomes very the, the, the adjustment for the regression

Â to mean becomes very vertical and that's because the correlation

Â was quite low because there was, quite a bit of noise.

Â Notice as if as, as the points if they

Â were to fall more and more on an identity line.

Â So they would kind of.

Â Collapse around this identity line and these two cross haired lines

Â would sort of go like this and they would converge to

Â the identity line themselves and that would mean that there was

Â very little regression to the Anyway it's a neat neat histor,

Â its a neat historical story that Francis Dalton figured this out and

Â his discovery of this was something that led to the discovery

Â of the of the, the far more advanced topic of regression.

Â but let's, you know, let's make some final, final comments about this.

Â So, an ideal examiner, I guess I'm not an ideal examiner from this test I gave.

Â Were there, there'd be little difference

Â between the identitiy line and In the in the fitted regression line and the

Â more unrelated the two exam scores are

Â the more pronounced the regression to the mean.

Â So and and and this is something I heard a talk about this once.

Â That there was a question of how how much of our

Â discussion of sports is really discussion of regression to the mean.

Â So the idea is really that there's a lot of noise in say teams' or

Â player's performances and the ones that do the best in any one

Â given year is a combination of them just being better plus random variation.

Â It was that random variation is very high then they'll, then, then there's

Â a good chance that the, the, the following year, the following season or whatever.

Â they'll have a, a, a far lower perofrmance

Â in whatever statistic or measure you're, you're talking about.

Â So for example

Â in US in baseball, a popular sport in the US you know someone has a, a particularly

Â high batting average early on in the season, there's

Â a good chance they'll regress to the mean in

Â the second, the latter half of the season And, you know, the, the reason for that is

Â because there is a part, a component of it

Â that is performance, that's inher, it's the player's skill.

Â And, and, and then there's a component that's that random variation.

Â And the larger the amount of random variation,

Â the more you'll see regression to the mean.

Â And that case, the, the discussion is a lot of discussion about

Â about sports, I think actually amounts to discussing regression to the mean.

Â That sometimes the players that have huge rebounds are, were just unlucky.

Â And the, the, the players

Â that initially, and then then got luckier in their la,

Â i, when they were, when they were having their rebound.

Â And the players that, that did extremely well, and then did a lot worse, they

Â didn't, their intrinsic ability may not have had changed to much.

Â But, but the, the simply they were lucky early on.

Â And then less lucky later on.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.