Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

41 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Techniques

This module is a bit of a hodge podge of important techniques. It includes methods for discrete matched pairs data as well as some classical non-parametric methods.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Hi, my name is Brian Caffo, and

Â this is Mathematical Biostatistics Boot Camp, lecture 12.

Â We're going to be talking about nonparametric statistics.

Â Okay, in this video, we're going to be

Â talking about nonparametric tests, which include the sign

Â test, which is useful for paired tests,

Â which is kind of related to McNamar's test.

Â Signed rank test which is also for paired data.

Â we'll talk about Monte Carlo versions of that test.

Â We'll talk

Â about independent group tests, and then from

Â that the Mann/Whitney test, and Monte Carlo

Â versions of that, and then their relationship

Â to permutation tests, which are very closely related.

Â Okay, so very briefly we're going to be

Â talking about here about so-called non-parametric tests,

Â but very kind of classical non-parametric tests,

Â and these are often called distribution free.

Â That of course doesn't mean that they're assumption free.

Â They, they involve assumptions, for example, sampling assumptions such as

Â [UNKNOWN].

Â but they require fewer assumptions than parametric methods.

Â they have a tendency to focus a little bit more on testing rather than estimation.

Â but, which maybe is a problem.

Â But there are estimation techniques that follow.

Â though they also tend not to be very sensitive to outlying objects.

Â [UNKNOWN]

Â and they're especially useful for data like ranks, if

Â the data actually come in the form of ranks.

Â Because they often involve transforming data to ranks.

Â the, they're not uniformly wonderful, because they do

Â throw out some information, which is their, their problem.

Â so then, and, and, because of that, they may wind up being less

Â powerful than their parametric counterparts when

Â the parametric assumptions are true of course.

Â and then, you know, for larger sample sizes, they become about as efficient

Â as their parametric counterparts, so they, they

Â are, are, are pretty, pretty good tests.

Â So here's some data from this wonderful book

Â from Rice called, Mathematical Statistics and Data Analysis.

Â I highly recommend this book.

Â I use the second addition, I don't know if that's what they're still on.

Â Anyway, this, this data concerned 25 fish.

Â They were taking mercury levels from the fish in parts per million.

Â And they had them at two locations on each fish, so, fish

Â one, they had a measurement of 0.32 and 0.39, and so on.

Â Okay, and then

Â just for reference here, I've added the difference between those

Â two measurements, subtracting the P from the SR, one each time.

Â And then I'll show the ranking and the signed ranks, I'll explain those

Â in a minute, but this is the data we're going to use as motivating data.

Â Here we want to test whether the mercury levels taking, taken at

Â location P differ from those of, of location SR where here

Â you know we're, we're, we're taking two measurements per fish.

Â Trying to control for the fish to fish variability by

Â taking both measurements on, on the same, on each fish.

Â Having each fish serving, it's, it's own control.

Â But we're going to be talking about non

Â parametric tests, so what we're interested in or

Â concerned about Is the validity of the assumptions

Â that go into typical tests, such as normality.

Â Okay, so let's let Di be the difference for each fish.

Â In this case, I'm subtracting P minus SR, and then let's let theta

Â be the population median of the difference, of the, of the differences Di.

Â And we want to test whether that median is zero versus the median being non-zero.

Â Now, now by the definition of the median, if theta equals 0, only if the probability

Â of a difference being greater than that value 0 is exactly 0.5, right?

Â That's the definition of a median.

Â Same thing for less than zero. Theorem.

Â and, so, so, as a test statistic why don't we just

Â count the number of times that d is bigger than zero.

Â And if it's excessively large or excessively small then that's going to

Â to dispute the idea, that, that the median is exactly zero.

Â so just as an example,

Â if all the differences were positive. Then zero couldn't be the median.

Â because you wouldn't expect a large sample, where

Â every single measurement was larger than the population median.

Â Okay.

Â Anyway then X here, each difference, if we're assuming the fish pairs are IID.

Â then each difference is in, is a, is a coin

Â flip, with a 50% chance of being above the median.

Â Or a 50% percent chance of being below the median.

Â So the

Â number of times it's larger than 0 is binomial.

Â and in P, and in this case, P is 0.5.

Â So our sign test just tests whether P is 0.5 using this data

Â X, and you can do an exact binomial test, like we've talked about before.

Â Okay, so let's go back to our example.

Â So theta is the median difference between, of P minus SR, our null

Â hypothesis is that theta eqauls zero, versus

Â the alternative that it's different from zero.

Â The number of instances where the difference

Â is bigger than zero, in this case

Â go back to the table, or so that would be 15 out of 25 fish.

Â And the binomial test then is the question of whether 15 is large 15

Â positive instances out of 25 is, is large. Now, our expected number out

Â of 25 is 12.5, so you know, 15, we don't know whether 15 is, is excessively large.

Â Well, in this case, it turns out that, no, 15 is not excessively large.

Â it has a, you know?

Â A .42% chance of happening under the null

Â hypothesis for a two sided test, in this case.

Â you know? And, again, you know?

Â We could have used a large sample test.

Â I don't know why, because we can do an exact test in this case.

Â But we could have used a large sample test.

Â And that's prob.test.

Â And you get a chi-squared statistic then

Â of 0.64, a p-value that's quite similar, 0.42.

Â a, a, any rate, the idea is that this is simply just

Â testing are, if you're going to assume that levels of one are higher

Â than the other, then you're just going to count the

Â number of pairs from the matched pairs, where it's higher.

Â And is that excessively large relative to a coin flip for each pair?

Â So that's the sign test.

Â And you might be wondering what's wrong with this,

Â so let's discuss some, some potential problems with these tests.

Â Because it doesn't use very many assumptions, right?

Â It doesn't, we aren't using very many assumptions.

Â We're using that the Fisch pairs are IID,

Â and that's about it.

Â Right, that's the only assumption that we're using.

Â but let's talk about what, what may be some of the problems.

Â Okay, so the biggest problem is of course that the magnitude of differences

Â is discarded, so it's potentially not as powerful as you would hope, right?

Â You know, it'd be different if all, you know, maybe if you only got half of

Â them as being positive, but all the ones

Â that were Positive, or much larger differences, and all

Â the ones that were negative were really small differences, then that would be

Â different than if they were kind of spread equally between, above and below 0.

Â so I would say that.

Â But then the other thing I would mention is there's nothing specific about 0.

Â You could have tested any mean, theta equal to

Â theta 0 By calculating the number of times the

Â difference is bigger than any specific value, testing whether

Â that the median is bigger than any specific value.

Â What's interesting about that, then, you know, we won't talk about this, but it

Â is kind of interesting, right, that you can do this for any value of theta.

Â So that means you can find the values of theta for which you

Â fail to reject, and the values of the theta for which you reject.

Â And then of course, if you can do

Â that, you know, by a grid searcher, say, something

Â like that, then you can invert that and

Â get a a confidence interval for, for the median.

Â And so this

Â is kind of an interesting way, a very highly non-parametric way to

Â get a confidence interval for the median of a set of data.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.