Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

48 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

So here's another example that was on Wikipedia, which is a wonderful example.

It's kind of a famous example in this area

just to describe the numbers, how this numerically can happen.

So in American baseball, a batting average.

So, if you've never seen or heard of baseball, you know,

baseball's a game where a player you know, look it up.

The player goes up with a bat and the pitcher throws the ball really hard.

It's really quite difficult

to hit a baseball, especially in the professional leagues

and throw the ball a hundred miles an hour.

So, the player swings the bat and tries to hit the ball.

The percentage of basically not exactly, but the percentage of time that the player

hits the ball (no period) is going to be their so-called batting average, right?

And really good players can do this, let's say, 30% of the time.

You know, excellent players, but most players are worse than that.

Okay.

So here's two players and their batting average.

So Player 1 had 10 at bats in the first half of the season and got 4 hits.

Player 2 had 100 at bats, and got 35 hits.

>Hits.

So player one's batting average was 40%, player two's batting average was 35%.

The second half of the season player one got

25 hits out of a 100, 25% batting average.

The second half of the season player two got 2 out of 10 hits,

20% batting average. So in both the first half and the second

half of the season, player 1 had a better batting average than player 2.

right?

If you just add up these numbers, 29 hits out of a 110 bats

[UNKNOWN]

that bats the whole season for player 1 and 37

bats for 110 for the whole season for player 2.

You get 26% for player 1 and 34% for

player 2 so, player two has a better batting average.

So, it seems paradoxical that a person can have a better batting average for both the

first half and the second half of the

season, but have a worse batting average overall.

But of course, the numbers work out, right you see it.

The numbers actually work out,

and so, you know, I put in, consider the number of bats here cause that.

Is coming into play that the, the, the player once had this very good batting

average when they had relative few bats and modest batting average when they had

lots of bats and vice-versa for their player.

So, that's I think really the culprit in this case.

Okay. In another very famous example

Simpson's Paradox is the so-called Berkeley Admissions Data.

And it's fine that it's in R, so I'll cover it

a little bit and you can explore it because you can.

To get it, you can do just help, U.C.B. Admission's and

then that will describe the data set. Data U.C.B. Admissions will load it up and

then here I give a little command of why you see the admissions, C(1,2),

sum. Get's the appropriate margin so here, we

looked at, whether or not, So, for admission

versus rejected by gender and we get that, a male's.

The the acceptance rate was higher for males

than it was for females, disregarding anything everything else.

Okay.

But, I give another command here and then, It shows the, now the admissions rate,

I'm not showing the counts because at the end it's getting a little bit on the

[UNKNOWN].

So, I'm showing admission rates by the department.

Department A,B,C,D,E,F and E and F.

And, then when you can see along, you know, for Department A.

Males got admitted fewer percent of At a time department B, males got admitted

fewer percentage of time, Department C males, you know, got admitted slightly

larger percentage, lower for D, slightly larger for E and lower F, so clearly

the admissions gender balance in the admissions is dependent on whether.

>Whether you are conditioning the evidence of gender imbalance in

admissions is dependent on whether or not you're conditioning on department.

and, and there's different application rates and you can explore this yourself.

There's different application rates, by gender, for each of the departments.

And so look here, in fact I, you don't even have to

explore it yourself, I apparently have it on here, on the next slide.

So gender of male female by department and you can

at any rate you can explore this a little bit more

because the data is just in R I don't think you have to have any other

packages or any data installed its just data

UCBA admissions and and play around with it.

So let me so let me talk a little bit

about, you know, what in the world is going on here?

because it's always, it's confusing, it seems confusing, you know to

me there's a couple things that help me understand Simpson's Paradox.

First thing is, the Math,

there's no problem with the Math, right.

If you're saying that, you know, a over b is less c

over d and e over f is less than g over h.

But, if you can find integers that satisfy the following equations

you know, where b has to be greater than a and f has to be greater than e, and so on.

But if you can find integers that satisfy these

equations, then you found an example of Simpson's Paradox,

you just have to put the context around it.

But doesn't seem at all paradoxical when you state it as a couple of

relationships between integers, that's why

right, Then, it just doesn't seem very paradoxal any-more.

It's the contacts that adds the paradox

and the, you know, from a statistical standpoint,

it says the apparent relationship between two variables can

change in the light or absence of a third.

Which again doesn't sound that paradoxical.

It's only when we conflate the

probabilistic statements and the evidence associated with

the probabilistic statements vis a vie the data with the causal statement, right.

So the problem is that we are going to try and

get at the cause to a truth by virtue of the probabilistic

statements associations sustained from the data, but that's a quite hard thing.

The question in all of these cases is, what's

the right answer what should you condition on or not

condition on (no period) and that's a hard problem

we're not going to really cover that in this class.

To me the real answer of this is that it's quite hard to exactly

figure out when you've conditioned enough. Right, in some cases no conditioning is

exactly the right answer and in some cases conditioning is exactly the right answer.

To really handle this formally, you have go to do something called

[UNKNOWN]

basically and that is really a

discipline that's really a sub-discipline of statistics

that is really, entirely designed towards

addressing this question in a formal manner.

in the meantime, let me say this in the meantime, what can you do and the idea is

to not decouple the statistics from the scientific discussion.

so in the case of the death penalty you would want to

have a discussion about hypothesis for

the causal mechanisms between the various associations.

You know, it doesn't make sense to be conditioning on the race of the victim.

In the Berkeley admissions data you would want to talk about

well, are there very different acceptance rates by departments and then.

Are there different application rates by gender to each department and, you know,

does the fact, say, you know, are are women applying to departments that

are harder to get into, that would explain the marginal association quite well.

And is that really the driver?

And that's a discussion, in a sense, a

discussion that's informed by the statistics, but extra-statistical.

And at this point, I think it is this

kind of interplay between the data in a scientific discussion.

Is the best solution I can offer to you for dealing with confounding.

When you take further statistic courses, you can learn some of

the formal mechanisms for trying to account for confounding, but suffice it

to say, it's one of the harder issues in statistics, knowing how

to balance the over adjustment with under adjustment in terms of confounding.

[UNKNOWN]

Is one of the central problems in in observational data analysis.

And, it's what makes observational data analysis so hard compared to

say for example, where you randomize treatment or something like that.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.