Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

34 ratings

Johns Hopkins University

34 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Discrete Data Settings

In this module, we'll discuss testing in discrete data settings. This includes the famous Fisher's exact test, as well as the many forms of tests for contingency table data. You'll learn the famous observed minus expected squared over the expected formula, that is broadly applicable.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

So consider this next example, where I, you know? I took this modified it a little bit from Agresti's wonderful Book on categorical data analysis.

so here we're looking at birth weight cross [UNKNOWN] classified by babies' birth weights cross classified by maternal age. And so let's assume that this, the way in which it was sampled as they were, say, 400 people sampled, so that by design, na, neither of the two margins were fixed. And we're kind of interested, then, in treating the cell counts as if they were multinomial Four dimensional, multinomial count with N equal to 400 total

And then what we would like to know is if the variable birth weight is independent of, of maternal age, versus the variable birth weight is not independent. So let's think about the problem this way. And see, we had logic our way to the expected cell counts. Okay. [SOUND] Okay, so let's let's first note that, under the not necessarily, even just under the null hypothesis. Just regardless our estimate of young maternal age is always going to be 100 over 400, and older maternal age will be 300 over 400. The margins of the table where we disregarded birth weight. Do the same thing for birth weight, disregarding maternal age would be 50 over 400 and then normal birth weight would be what, 350 over 400. Okay?

And the cell probabilities would then give you the specific, you know, combined probabilities, if we want to talk about younger, maternal age and lower birth weight. That would be this the our estimate of this regardless of the hypothesis. or under the alternative hypothesis, would be 20 over 400. But if we're under the null hypothesis, we're assuming maternal birth weight in maternal age and, and birth weight are independent, then we kind of logically construct this probability as the multiple of these two marginal probabilities, because we're probably estimating the margins a little better.

So that would be 0.25 times 0.125. Here because we're, we're multiplying the marginal probability of young maternal age times the marginal probability of low birth weight.

Okay? So that only, we can only do that under the null hypothesis. So let's, let's work on our expected counts. So our expected counts for the 1,1 cell of low birth weight and young maternal age is this probability, then the number of counts we would expect in there is times the 400 sample size. And we get 12.5 as our expected count. So then you can follow through for all of the other three remaining cells in the same way. [UNKNOWN] get the expected counts. And then compare them using the same formula, observed minus expected squared over expected. And we get our qchi squared statistic, which in this case is 6.86. We then compare it to a qchi squared critical value which is around four. Of course we, we talked about it being you know, the square of the Z statistic. So if that makes sense, that it would be around four.

and or we could just calculate our Chi-squared P value, which would be the probability of, of, of getting a test statistic as large as 6.86 or larger. So I hope everyone can follow these calculations. And the idea is that we're basically calculating how distance are observed counts are from kind of our best estimate of what we would expect the counts to be under the hypotheses under the nod potheses of independence, between the row and column variables.

I, I should also add before I complete this slide, that the answer we get from this test of independence, is identical to the answer we get from the test of proportions. You get the same chi squared value, the same P value.

now the interpretation might be very different as we talked about before, if in one case there was randomization as to which of the rows you like and forces. In thins case if it was a multinomial sample, the, the interpretation results dramatically different and we don't really. cover you know, a lot of things like epidemiological style sampling designs in this class, but you know, suffice it to say, the interpretations can be very different. But nonetheless, the actual number, the P value we see, is identical depending on either, identical regardless of which of the sets of assumptions you see, you make. So that's interesting, and what's even more interesting is that you can formulate Poisson models that then again yield the same conclusions.

So here's another example of Agresti's Categorical Analysis book. here's, which, which is a book I highly recommend. And I, I, I think it's, it's, it's a real classic in the area.

but, you know, but I should disclose a conflict of interest that Agresti is a great friend and close colleague of mine. [COUGH] any way, so in this example he is looking at, at different collection of o, occupations and looking, cross-classifying it by alcohol use. And suppose the investigators in this trial, or in this study, you know, went out and found 300 Clergy, 250 Educators, 300 Executives, and 350 Retailers, and then you know, asked them a question about their alcohol use. And we'd like to know whether alcohol use differs by occupation.

interest then lies in whether or not in testing whether or not the proportion of high alcohol use is the same in the, in, in the four occupations.

so if we label P1, you know, the clergy, the proportion of high alcohol use among the clergy, and so on. We want to test whether they are all equal, but we don't want to specify what proportion it is, so let's say P is the common proportion across all occupations. And then the alternative would be the opposite of that, that at least two are unequal. so our estimate of P this common unknown proportion the obvious estimate of it would be 233 over 1200 and the obvious then well then of course the estimate of 1 minus P would be 967 over 12 1200. So what would? So our observed first count is 32 what would be our expected first count. Well yeah, our com, our estimate, if you, if, under the no hypothesis. Where occupation is irrelevant, we expect about, you know 233 over 1200% to be high alcohol users. So we multiply that times the, row count, 300. And we would get the observed cell count.

then the, the low alcohol usage, 268, well, you know? They, they have to, you know. These have to add up to the margins, by the way, that the ex, the both the observed and the expected counts have to add up to the margins. So you could, you know take 300 minus the the 1, 1 cell count, to get it. But otherwise, you could just say 300 times the probability of low alcohol usage, which is then 1 minus 233 over 1200, or 967 over 1200. And you would repeat that down and down for for each occupation. Calculate our Chi-squared statistic, observed minus expected squared over expected. The sum of all of them, you get 20 20.6. And we need to compare it to a Chi-squared critical value. But the degrees of freedom change. And it turns out that the general rule of the degrees of freedom for the Chi-squared. In these settings is rows minus 1 times columns minus 1. So in this case there's one, two, three, four rows and two columns. So the degrees of freedom is three. So then here's our P value, P chi squared, 20.59, 3 degrees of freedom, lower tail equals false. It's about zero. It's pretty clear that some of them are different. [SOUND] Okay, here's, here's another example. And we're going to, we're going to do.

so this is from Rice's book Mathematical Statistics and Data Analysis. Which is another, first of all, I have no affiliation and have never met Rice, so it's easier for me to say this. But I, I love this book, I think it's wonderful, this Mathematical Statistics and Data Analysis book. So, if you're looking for a book recommendation, I like that one [SOUND]. in addition, I really like Agresi's book, but I'm willing to stipulate my conflict of interest in recommending it. but I do really like it. I read it all the time.

so anyway in this book does, he has this interesting example, where a bunch of word, words taken from some novels that were [UNKNOWN] one of them was preparat, w, well two of them were known to be Jane Austen novels. And one of them was, was in question as to whether or not it was from written by that author. Let's say they found it later. And

there's maybe other ways you would want to analyze this data for this reason, but, you know, we want to use it as an example for the chi squared. So don't think too hard about specifically how you would analyze this data. Because I, I doubt this would be what you would arrive at immediately. But it, it's not unsensible, by the way. It, it's reasonable.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.