So we've described some examples, some very basic examples of random variables.
So, what we need is a, a mathematics. Of random variables to use them.
And we have a mathematics of probability. And, we've acknowledged we're at least
willing to think of. Kinds of variables as if they're random.
We'd like to, put those two ideas together.
So we need functions that map the rules of probability to random variables, and so
for discrete random variables the kind of functions that we are talking about are so
called probability mass functions. So probability mass function is simply a
function that takes the values that random variable can take and maps it to their
associated probabilities. So for a die, the probability of p of one
would be one-sixth for example. And it turns out quite a few functions
satisfy the definition of being a probability mass function.
In fact, you only have to satisfy two rules if you'd like to be a probability
mass function. The first rule is that you have to be
bigger than zero for all of the arguments, where here, x is the collection of
possible values that a random variable can take.
And the second rule is that if you sum over all possible values, then you get
one. This is just exactly analogous to our
probability statement that the probability of the whole sample space has to be one.
But here we've put it in the terms of a probability mass function.
I want to talk a little bit about this notation.
Notice here I have this small x, and when we define random variables, two pages
previously, we used the capital X. So this is very common and maybe slightly
unfortunate notation, but it is used everywhere, so you might as well get used
to it instead of fighting it, that we use an uppercase letter.
Typically, to represent the random variable as a conceptual entity.
So if we say capital X, we're talking about a die roll that we could have.
When we use a lower case x, or a lower case y, or a lower case letter of any
sort, we tend to be talking about. Realized values of the random variable.
So the X lower case should be something that you should be able to plug a number
into where capital X is a conceptual random variable.
It's a conceptual flip of a coin, it's a conceptual role of a dye.
Lower case X is one or two or three or zero okay and it's slightly unfortunate
notation it takes a little bit of getting use to.
But I think for everyone who works on statistics of probability they've gotten
use to it and everyone does it so you might as well do it too.
Let's go over an example of constructing a probability mass function.
Let's take the simplest possible example, a coin flip.
So let's let x be the result of a coin flip, where zero represents tails and one
would represent a head. So we want the function, and let's assume
that the coin is fair. So we want a function that maps zero to
one-half and one to one-half, and there's infinitely many ways you could write down
that function. Well, we're gonna pick one.
Here we write it as one-half raised to the power of x.
Times one-half raised to the power of one minus x.
And notice that if you plug in x equals zero, you get one-half, and if you plug in
x equal one, you also get one-half. Now let's go to a slightly more
complicated example where we assume that the coin is potentially biased i.e.
That it's not fair. So let's let theta be the probability of a
head. In this case, expressed as a proportion
between zero and one. So, just as an example, imagine if theta
was.3 instead of a half, then we would think that the probability of a head was.3
and the probability of a tail is.7, but let's leave it as theta for right now.
So we want a function that says probability of a zero is one minus theta
and the probability of a one is theta. And then we see our function here, theta
to the x, one minus theta to the one minus x exactly satisfies these properties.
This is another common notation in the field of statistics like greek letters
like theta represent the things we don't know, that we would like to know.
So imagine if you had a coin, and you didn't know whether or not it was fair, we
would represent that unknown probability of head as theta.
So I want to give you a sense of where we were going.
In this case the probability of mass function is the entity that governs the
population of coin flips. And so, if we want it to know theta, we
are gonna collect data to estimate it, and then to evaluate the uncertainty in that
estimate. And the way we are going to evaluate
uncertainty in that estimate is using this probability distribution.
So all the probability distributions we are going to talk about are conceptual
models of populations and they are the entities that are going to tie our data to
the population. So at any rate, right now this may sound a
little heavy, and we'll discuss this in much more detail throughout the entire
class, but the one rule I want you to remember right now is that unknown things
that we want to know, like in this case, what would be the probability of a head,
are generally denoted in Greek letters. These are called parameters usually.
I also want to note one other thing. Why is it among all the possible ways that
we could have written out this probability mass function did we choose theta to the
x, one minus theta to the one minus x? There's lots of different ways we could
have done this. You can try and figure some of them out
yourself. Well it turns out, and we'll discuss this
at length, that in probability, multiplying is very useful.
And so we want, probability mass functions that make.
Multiplication very easy. So if we take things and raise them to
powers, then multiplying becomes easy. And that's a general, rule.
And we'll tell, you'll see later on why this is the case.
But any rate, this is why we choose this particular form.
Of the probability mass function when you could write it so many different ways.
But I wanna say, people have. Thought about this a lot, and this is.
Definitely the most useful way to write out, this particular probability mass
function. So consider again the unfair coin.
Our probability mass function satisfies p of zero equals one minus theta and p of
one equals theta. Let's just go through the exercise to
prove to ourselves that this is in fact a probability mass function.
It's greater than zero because it's one minus theta for zero and theta for one and
in this case theta is in between theta and one.
So, it's going to be greater for zero for x equal zero and one.
And then, the sum of the probabilities, probability of zero plus the probability
of one, in this case is theta plus one minus theta which is one.
So it satisfies the two rules that probability mass functions have to
satisfy. So that covers our principle entity that
we're going to use to model discrete random variables, probability mass
functions. So now we need to cover our principle
entity that we're going to use to model continuous random variables, which are
called probability density functions. So probability density functions are
abbreviated PDF by the way, so it stands for probability density function not
portable document format, which is what lots of people think of it as pdf, but in
statistics no one thinks of pdfs that way. I want you to remember one very important
role and I put it in italics to make it sure everyone remembers it and by the end
of the course this will be second nature to you, but if haven't seen it before, it
might seem a little odd. But the way that probability density
functions work are that areas under probability density functions correspond
to probabilities for the random variable. And there's definitely one undisputed king
of all PDFs, and that is the so-called bell curve.
So if you ever wondered what a bell curve was, if you hear it talked about a lot,
the so-called normal density function, you might wonder what in the world is a bell
curve accomplishing. Well.
Areas under bell curves correspond to probabilities.
So if you're modeling something as if the population it belongs to follows a bell
curve, then you are saying that, that probabilities associated with that random
variable are governed by areas under that bell curve.
That's just one example of a pdf. There is a lots of different kinds of
pdfs. So just like probability maths functions
have to follow two rules, probability density functions have to follow two rules
to be a valid probability density function.
They have to positive for all the possible values that the random variable can take,
that's called a support usually, and their integral has to be one.
I would also say a, a small point here. We define probability density functions as
if they, operate on the whole. Real line.
So even if your, random variable can only take values say between zero and two like
we talked about earlier with the pencil experiment.
Even if that's the case, we define the probability density function as zero below
zero and zero above two so that there's no associated probability, but we've defined
the probability on the whole real line so that we define its integral from minus
infinity to plus infinity. And I think in this class we tend to be a
little bit fuzzy about, sometimes operate on minus infinity to plus infinity, in
other times we will just write out zero to two, discarding all the area where the
function is zero and I hope from the context it will be clear what we are
doing. This final property, property two here
that the integral of your whole real line of the probability density function has to
be one, is simply again saying that the random variable has to take some value,
that it has to be in some interval in the real whole line.
Let's go through it specific example of a P D F and let's put it in a context.
So let's soon that the time in years from diagnosis until death with a specific kind
of cancer follows the density that looks like this.
Alpha vacsillesiii each as a negative x of five divided by five for x greater than
zero. The greater than zero been contextually
clear because you can have negative time from diagnosis and the person is
presuminglyiii alive at the time of diagnosis.
This is a very restricted example of a density that's commonly used in these
sorts of analyses of things like survival times.
It's called the exponential density function.
And again here you see that we have f(x) written as e to the negative x over five
over five for x bigger than zero, and zero otherwise, like I talked about in the
previous slide, we often just ditched that zero and talked about f(x) being, the
kernel of the function and the just either explicitly write or sometimes we will
fudge a little bit that x has to be greater than zero, if it is clear from the
context if that has to be the case. In this case it would be clear from the
context. Is this a valid density?
Could we model survival time after diagnosis with this density?
Well first of all we know that the function is positive because e raise to
any power is always positive, and then lets just check whether or not it
integrates to one. So we want the integral form minus
infinity to plus infinity but like we said that all of the meet of the distributions
starts at zero and goes from infinity, so lets just say the integral from zero to
infinity, f(x) dx is in this case the, anti-derivative is negative e to the
negative x over five, which when evaluated from zero to infinity yields one.
Let's go through an example, of. Using, this probability density function
to assign probabilities. So imagine if, we were to model this
population as if it followed this specific.
Exponential probability distribution. And imagine if someone asked us the
question, "What's the probability that a randomly selected person from this
population survived more than six years?" So if X, is the.
Conceptual, value. That, a random person takes.
We want to know, what's the probability that X is greater than or equal to six?
As represented by this, probability statement.
Remember again the golden rule for. Probability density functions that areas
under the curve correspond to probabilities.
So, if we want the probability x is greater than six.
We want the integral from sticks to infinity of the probability density
function and you can go through the calculus here to get the that works out to
be about 30%. In the statistical programming language or
you can do this automatically, it just does the integral for you, it uses a
numerical approximation and you just write Px6, for the fact that we want the
probability of six or larger. One fifth represents this parameter five
that you see in the exponential distribution.
Lower dot tail equals false, means that we want the probability being larger than six
rather than the probability being smaller than six.
So lower dot tail equals true, will give you six or smaller, lower dot tail equals
false, will give you six or larger. I want to elaborate on that point, by the
way. For a continuous random variable, the
probability that it takes any specific value is in fact zero.
Now that seems strange, but it's true. So remember areas under probability
density functions correspond to probabilities.
So what's the area of a line? It's zero.
Now, you might say, now that doesn't make any sense at all.
Specific values have to take probabilities because we see specific values when we
actually observe variables. The point is, is that our.
Probability density function is a model and it is defined on continuous random
variables. Continuous means measured to infinite
precision. And so, when we observe things, we never
measure them to infinite position, we never measure them to finite position.
And probability density functions are perfectly happy with saying, the
probability that x is 6.01 to 5.99 in assigning a perfectly valid probability to
that. But the probability that is exactly six is
zero. Because remember exactly six means 6.0
followed by an infinite trail of 0s, or 5.99 followed by an infinite trail of 9s.
Either way, that's the idea behind what probability density functions are getting
at. They're modeling truly continuous random
variables. So just remember that, when we observe
data. We of course measure them with finite
precision, but. Our, continuous.
Model is exactly that, it's a model. We find it far more useful in many
circumstances, to model random variables as if they were truly continuous.
Than to account for all the potential specific values they could take.
So, in this specific example a, a person will only measure how long they survive to
the year. Maybe to the month, maybe to the day.
Maybe to the hour, to the minute. To the second, but probably not much
further than that. And so, we're only going to measure to
finite precision. Nonetheless, it's still is much more
useful to model that as if it was continuous because we don't want to have
to assign probabilities to every single value.
We want to assign a general function. And that's why.
Continuous random variables are so intrinsically useful.
So my, the belabored point I'm trying to make.
This, by the way, is that whether or not you write probability x being greater than
or equal to six. Or the probability of x being strictly
greater than six in this case doesn't change this calculation whatsoever.
You get.301 either way. And so it also doesn't make a difference
in the probability exponential for the, our example.
It doesn't matter. Whether you specify lower tail or upper
tail, whether you're thinking about whether or not that includes six, it
doesn't care about that. However, for discrete random variables, it
makes a big difference, right? Because specific values.
Have actual probabilities assigned to it, so a die can take the value one, two,
three, four, five, or six. So in R if you are using these probability
functions, so Px are probabilities from the exponential distribution, P binomial
are probability from binomial distribution, P Poisson or Pois for
Poisson is probability from the Poisson distribution, P gamma probabilities from
the gamma distribution, are follows that rule pretty neatly.
If it's a discrete random variable, you have to be careful about whether or not
it's including the six. For a continuous random variable, you can
be very sloppy about it. So here I'm just depicting the area that
we're calculating. This grey area is the survival time from
six to infinity. This is simply the integral that we're
actually calculating, and I'll put the R code to generate exactly this figure in
the files for the course.