A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

238 ratings

Johns Hopkins University

238 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Introduction and Module 1

This module, consisting of one lecture set, is intended to whet your appetite for the course, and examine the role of biostatistics in public health and medical research. Topics covered include study design types, data types, and data summarization.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

In this relatively short lecture section we're

going to discuss some of the types of

data we will encounter in this course and learn how to deal with analytically.

So in this short lecture section a brief summary is given of the types of

data that frequently occur in research studies and

will be dealt with analytically in this class.

And at the end of this lecture session, you should

be able to distinguish between continuous, binary, and categorical and

time-to-event data and give examples of each of these aforementioned data types.

So the first type of data we're going to work with is called

continuous data, data that takes on

a range of values with incremental measurements.

Some examples of continuous data would

include blood pressures, measured in millimeters of

mercury, weight measured in pounds or kilograms or ounces or some other unit.

Height measured in feet or centimeters or inches, et cetera.

Age measured in years or months or decades, et cetera.

Income level measured in dollars per year,

euros per year, rupees per month, et cetera.

We can go on giving you examples but a defining

characteristic of continuous data is that a one unit change

in the value means the same thing across the entire range of data values.

A one millimeter mercury difference in blood pressure is

comparable whether we're comparing groups who have 120 versus the 119

millimeters mercury, or 105 versus 104 for example.

Another type of data we'll be dealing with and frequently used in

public health and medicine to quantify things is binary or dichotomus data.

It takes them only two values, quote unquote, yes or no.

Does the person have a condition or an outcome or a disease?

So some examples of binary measures of someone's polio status.

Do they have polio?

Yes they have polio, no they do not.

Amongst cancer patients is the person currently in remission?

Yes or no?

The sex of a person, male or female, or as a yes or no formulation, the

question could be, is the subject male? Yes if they're male, no if they're female.

Did this person quit smoking? Yes or no.

An extension of the binary data would be categorical

data, something, extension of binary data to include more than two possible values.

And there's two types of categorical data. One is called nominal categorical data.

These are data levels that had no inherent ordering to categories

like race or ethnicity or country of birth or religious affiliation.

So the categories, there may be more than two categories but there

is no unique inherent ordering to the categories.

And then ordinal categorical data would be where there's order to the categories.

For example, if we, instead of measuring someone's

income level on a continuum, we ask them about

ranges that their income level was in, then maybe

have four categories from the lowest to the greatest.

Or many of you have answered questions on surveys which ask you to give your degree

of agreement from, say, strongly disagree

up to strongly agree across five categories.

So there is inherent ordering from strongly disagree to strongly agree in

terms of increasing values on the scale mean increasing degrees of agreement.

And finally, we will be working with something called time to event data.

Data that are a hybrid of continuous data and binary data.

Kind of encapsulates two pieces of information about

an event, whether an event occurs and the

time to the occurrence of event or time to the last follow up without the occurrence.

So this data arises in studies in cohort studies over

following observations over time to see whether they develop

a condition or not within the time period, and

if they do develop the condition, or have the

outcome, we measure the time at which it occurs.

So, the reason I bring all of these up, and we'll

give detailed examples throughout the next set of lectures, sets of lectures.

In fact, throughout this entire term. But the reason

we bring this up is there's

different statistics to quantify different data types.

So for example, for continuous data, for example

blood pressures, if we were to compare blood

pressures in a clinical trial evaluating two blood

pressure lowering medications, or intended to lower medications.

So we had two groups and we wanted to see

how the change in blood pressures differ between the two groups

we could measure the association between the change in

blood pressure and the medication by estimating the mean

difference in blood pressure change after the study's over

to the before study measurement between the two treatment groups.

This would quantify the degree of association

between the treatments, and blood pressure change.

And then estimate what's called a 95% confidence

interval, and, or use what's called a two sample t-test

to test for population level differences in the mean blood-pressure change.

And what I mean by that is, we're only

going to be able to evaluate information from a sample.

And there's some uncertainty in that

information, because we're dealing with the

samples which are imperfect subsets of the populations we wish to study.

And we're ultimately want to, going to want to

incorporate that uncertainty into a statement about

the behind the scenes populations we've sampled from.

If we were trying to compare the proportion of polio

cases in the two treatment arms of the soft polio vaccine.

We could, to quantify the association with the vaccine versus not.

We could estimate the difference in proportions,

the difference in the proportions of children contracting

polio between those who got the vaccine and those in the placebo.

This will be called the risk difference.

We could also quantify this by taking the

ratio of proportions, what would be called a relative

risk or risk ratio and then we could

actually estimate a 95% confidence intervals for those quantities.

And or use a chi squared test to

test for population level differences based on our study

sample results. Suppose we had time to event data.

To compare differences in time to contracting HIV

between HIV negative IV drug users in a

needle exchange program and HIV negative IV drug

users not enrolled in a needle exchange program.

A researcher could estimate the incidence rate ratio,

for contracting HIV, that compares these two groups,

construct what's called a Kaplan-Meier curve for each

group to provide a graphical description of the

time to HIV profile for the needle exchange

group, and the group without the needle exchange.

And then estimate a 95% confidence interval for this incidence rate ratio and

or use what is called a log-rank test to test for a population level difference.

So what we're going to be doing over

the the next term is we're going to actually be dealing with

studies where the measurements of interest are done on these different scales.

And we're going to go into extensive detail about some of the

quantifications we alluded to in these past few slides, and also get to

this inferential part where we'll be taking our sample results, adding in the

uncertainty coming from the fact that

these results are based on imperfect samples

from some larger population.

Or populations of interest, and

quantifying the degree of that uncertainty.

So we'll be doing things like talking about

why a mean difference is a reasonable reason,

way to compare continuous outcomes between different groups,

how to put confidence limits on that difference.

How to test for a real difference after accounting

for uncertainty in the estimate, and so on and

so forth.

So we have a huge exciting term lined up and we'll be

dealing with all of these things in intimate detail from the first

principles of how to summarize the results from a single sample to

how to quantify association when comparing

different samples to doing the statistical inference.

[BLANK_AUDIO]

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.