0:12

In 1972, as part of a study on gender discrimination, 48 male bank

supervisors were each given the same personnel file, and asked to judge whether

the person should be promoted to a branch manager job that was described as routine.

The files were identical except that half of the supervisors had files showing

the person was male while the other half had files showing the person was female.

It was randomly determined which supervisors got male applications and

which got female applications.

Of the 48 files reviewed 35 were promoted.

The study is testing whether females are unfairly discriminated against.

Let's take a look at the data.

The percentage of males promoted is 21 out of 24, roughly 88%.

And the percentage of females promoted is 14 out of 24, roughly 58%.

So there's a considerable difference between the proportions of males and

females promoted in this study.

1:10

There are two possible explanations as to what might be going on in this study.

And these are our two competing claims.

One, there is nothing going on.

Promotion and gender are independent.

There's no gender discrimination, and

the observed difference in proportions is simply due to chance.

This is our null hypothesis.

And two, there is something going on.

Promotion and gender are dependent on each other.

There is gender discrimination,

that observed difference in proportions is not due to chance.

This is the alternative hypothesis.

2:08

If the data were likely to have occurred under the assumption that the null

hypothesis were true, then we would fail to reject the null hypothesis, and

state that the evidence is not sufficient to suggest that the defendant is guilty.

Note that when this happens, the jury returns with a verdict of not guilty.

The jury does not say the defendant is innocent,

just that there is not enough evidence to convict.

The defendant may in fact be innocent but the jury has no way of being sure.

Said statistically, we fail to reject the null hypothesis.

We never declare the null hypothesis to be true.

Because we do not know and cannot prove whether it's true or not.

Therefore, we also never say that we would accept the null hypothesis.

If the data were very unlikely to have occurred, then the evidence raises

more than a reasonable doubt in our minds about the null hypothesis, and hence

we reject the null hypothesis in favor of the alternative hypothesis of guilty.

In a trial, the burden of proof is on the prosecution.

In a hypothesis test, the burden of proof is on the unusual claim.

The null hypothesis is the ordinary state of affairs, the status quo.

So it's the alternative hypothesis that we must consider unusual, and for

which we must gather evidence.

3:30

So to recap,

we start with a null hypothesis that represents that status quo.

We also have an alternative hypothesis that represents our research question,

in other words, what we're testing for.

We conduct a hypothesis test under the assumption that the null hypothesis is

true, either via simulation or using theoretical methods.

If the test results suggest that the data do not provide convincing evidence for

the alternative hypothesis, we stick with the null hypothesis.

If they do, then we reject the null hypothesis in favor of the alternative.

4:05

So if you have a deck of playing cards handy,

you can actually conduct the simulation yourself with me.

Remember, the objective is to conduct a simulation under the assumption that

the null hypothesis is true.

In other words, assuming there is no gender discrimination.

And that differences in promotion rates that are observed,

are simply due to chance.

First, we're going to let a face card represent a not promoted, and

a non face card represent a promoted file.

We're going to first start with setting aside the jokers

4:40

There are 52 cards in a deck, however, only 48 files in our experiment.

To simulate the experiment,

we need to remove some cards to hit a total sample size of 48.

We take cards out in such a way that if we let

a face card represent not promoted and a non-face card represent a promoted file.

The distribution of face and

non face cards match the distribution of the promoted and not promoted files.

So, we're also going to take out three aces.

5:46

The same number, same number as the observations in our study.

Number cards represent files that were promoted, and there are 35 of them.

And face cards represent files that were not promoted, and there are 13 of those.

Then, we shuffle the cards and

deal them into two groups of size 24, representing males and females.

Note that random shuffling is what simulates this idea of

leaving things up to chance.

6:38

Let's go through the results of my simulation together.

If you have been following along with your own deck of cards,

you might have different results than mine since the shuffling and

splitting into two piles was done completely randomly.

Since we're randomly splitting the promoted files into two groups, we would

expect to see no difference between the proportions of male and female promotions.

In other words, the proportions of number cards in the male and female piles.

That being said, the observed value may not exactly be zero.

In this case, we had 18 number cards in the male pile,

which yields a 75% promotion rate among the males.

And there are 17 number cards in the female pile.

Yielding a 70.8% promotion rate.

The difference between the simulated promotion rates is what we want to

keep track of.

We expect this number to be zero, but we also expect it to vary, and

we want to know how much it varies so that we can compare our original difference of

30% to the distribution of differences simulated under the assumption of

independence between promotion decisions and gender.

In this case, we calculated the difference of 4.2%.

So, we note that, before we proceed to the next simulation.

8:28

It doesn't really matter which one you're calling male versus female.

So let's just say this is our male pile, and this is our female pile.

The next step is going to be to determine how many files were promoted in each pile.

Which means we need to count the number of number cards in each pile.

Among the males, I'm counting one, two, three,

four, five, six, seven, eight,

nine, ten, 11, 12, 13, 14, 15, 17.

So we have 17 out of 24 males promoted.

Which should leave about 18 out of 24 females promoted.

In the next step we need to calculate the proportions and take the difference and

note that on our dot plot.

And we would repeat this many, many times to build a simulation distribution.

So how do we ultimately make a decision?

If the results from the simulations look like the data,

then we decide that the difference between the proportions of promoted files,

between males and females, was due to chance.

And that promotion and gender are independent.

If, on the other hand, the results from the simulations do not look like the data,

then we decide that the observed difference in the promotion rates

was unlikely to have happened just by chance, and

that it can be attributed to an actual effect of gender.

In other words, we conclude that these data provide evidence of

a dependency between promotion decisions, and gender.

If we repeat the simulation many times, and record the simulated differences in

proportions of males and females promoted, we can build a distribution like this one.

For example, here we have a dot plot of the distribution of

the simulated differences, and promotion rates based on a hundred simulations.

While we showed earlier how to simulate this experiment using playing cards,

we should note that the task of the simulation is best left up to computation.

It's faster and less prone to errors.

The distribution is centered at zero which we can also think about as the null value,

since according to the null hypothesis,

there should be no difference between the proportion rates of males and females.

Yielding a difference of zero.

We can see from the distribution of the simulated differences in promotion rates,

that it is very rare to get a difference as high as 30%,

the observed difference from the original data.

If in fact gender does not play a part in promotion decisions.

The low likelihood of this event, or a difference even more extreme,

suggests that promotion decisions may not be independent of gender, and so

we would reject the null hypothesis.

Our conclusion is then that these data show convincing evidence of an association

between gender and promotion decisions made by male bank supervisors.

11:33

Then we simulated the experiment.

Assuming that the null hypothesis were true, we evaluated the probability of

observing an outcome at least as extreme as the one observed in the original data.

And since this probability was low,

we decided to reject the null hypothesis in favor of the alternative.

The probability of observing data,

at least as extreme as the one observed in the original study,

under the assumption that the null hypothesis is true, is called the p-value.

One of the commonly used criteria for

making decisions between competing hypotheses.

We will continue our discussion on p-values and

hypothesis tests in future units as well and learn various methods for

conducting hypothesis tests for various types of data.