So, now we're getting into Bayes Net Lab. And we're finally going to start talking
about the actual representations that are going to be the bread and butter of what
we're going to describe in this class. And, so we're going to start by defining
the basic semantics of a Bayesian network and how it's constructed from a set of
from a set of factor. So let's start by looking at a running
example that will see us throughout a large part of at least the first part of
this course and this is what we call the student example.
So in the student example, we have a student whose taking the class for a
grade and we're going to use the first letter of of the word to denote the name
of the random variable just like we did in previous examples.
So, here, the random variable is going to be G.
now the weight of the student obviously depends on how difficult the course that
he or she is taking and the intelligence of the student.
So that gives us in addition to G, we also have D and I. And we're going to
add a couple of extra random variables just to make things a little bit more
interesting. So we're going to assume that the student has taken the SAT,
so, he may or may have not scored well on the SAT,
so that's another random variable, S. And then finally, we have, that this case
of the disappearing line, we also have the recommendation letter, L, that the
student gets from the instructor of the class.
Okay? And we're going to grossly oversimplify this problem by basically
binarizing everything except for grades. So everything has only has two values
except for grade that has three and this is only so I can write things compactly.
This is not a limitation of the framework, it's just so that the
probability distribution don't become unmanageable.
Okay. So now, let's think about how we can construct the dependencies of of
this, in this probability distribution. Okay.
So, let's start with the random variable grade.
I'm going to put G in the middle and ask ourselves what the grade of the student
depends on. And it seems, just, you know, from a
completely intuitive perspective, it seems clear that the grade of the student
depends on the difficulty of the course and on the intelligence of the student.
And so we already have a little baby Bayesian network with three random
variables. Let's now take the other random variable
and introduce them into the mix. so for example, the SAT score of the
student doesn't seem to depend on the difficulty of the course or on the grade
that the student gets in the course. The only thing it's likely to depend on
in the context of this model is the intelligence of the student.
And finally, caricaturing the way in which instructors write recommendation
letters. We're going to assume that the quality of
the letter depends only on the student's grade,
but professor's teaching, you know, 600 students or maybe 100,000 online
students. And so, the only thing that one can say about the student is by looking
at their actual grade record and so the and so, regardless of anything else, the
quality of the letter depends only on the grade.
Now, this is a model of the dependencies, it's only one model that one can
construct through these dependencies. So for example I could easily imagine
other models, for instance, ones that have students who
are brighter taking harder courses in which case, there might be potentially an
edge between I and D. But we're not going to use that model,
so let's erase that because we're going to stick with a simpler model for the
time being. But, this is only to highlight the fact
that a model is not set in stone, it's a representation of how we believe the
world works. So, here is the model drawn out a little
bit more nicely than than the picture before.
And now let's think about what we need to do in order to turn this into our
presentation of probability distribution, because right now, all it is is a bunch
of you know nodes stuck together with edges and so how do we actually get this
to represent the probability distribution?
And the way which we're going to do that is we're going to annotate each of the
nodes in the network with what's called, with a CPD.
So, we previously defined CPD. CPD is just as a reminder,
is a conditional probability distribution hm,
using the abbreviation here. And, each of theses is a CPD,
so we have five nodes, we have five CPDs. Now, if you look at some of these CPDs,
they're kind of degenerate, so for example, the difficulty CPD isn't
actually conditioned on anything. It's just a unconditional probability
distribution that tells us, for example, that courses are only 40%.
likely to be difficult and 60% to be easy.
here is a similar unconditional probability for intelligence.
Now this gets more interesting when you look at the actual conditional
probability distributions. So here, for example, is a CPD that we've
seen before this is the CPD of the grades A, B, and C.
So, here is the conditional probability date distribution that we've already seen
before for the probability of grade given intelligence, and difficulty, and we've
already discussed how each of these rows necessarily sums to one because the
probability distribution over the variable grade and we have two other
CPD's here. In this case, the probability of SAT
given intelligence and the probability of letter given grade.
So, just to write this out completely, we have P of D,
P of I, P of G given I, D, P of L given G, and P of F given I.
And that now is a fully parameterized Bayesian network and what we'll show next
is how this Bayesian network produces a joint probability distribution over these
five variables. So, here are my CPDs and what we're going
to define now is the chain rule for Bayesian networks and that chain rule
basically takes these different CPDs, these, all these little CPDs and
multiplies them together, like that.
Now, before we think of what that means, let us first note that this is actually a
factor product in exactly the same way that we just defined.
So here, we have five factors, they have overlapping scopes and what we
end up with is a factor product that gives us a big, big factor whose scope is
five variables. So what does that translate into when we
apply the chain rule for Bayesian networks in the context of the particular
example? So let's look at this particular
assignment and remember there's going to be a bunch of these different assignments
and I'm just going to compute this the probability of this one.
So the probability of d0, i1, g3, s1, and l1, well, so the first thing we need is
the probability of d0 and the probability of d0 is 0.6.