0:01

So over the course of the last few lectures,

Â we've talked about a number of different inference strategies and

Â how they might be applied to a range of different graphical models.

Â One question that comes up is whether the same strategies can be used for

Â the kinds of template-based models that we defined earlier in the course,

Â models that are defined via, say, dynamic Bayesian networks or via plate models,

Â or one of the other representations that uses repeated structure.

Â So, as we'll see, the answer to that is both yes and no.

Â That is, the same strategies can be applied, but

Â they're also some interesting new challenges that arise from this.

Â 0:39

So, first, let's remind ourselves of how, for example,

Â a dynamic base net is specified.

Â So as a reminder,

Â a dynamic base net is specified using a initial time 0 distribution, which is

Â usually a Bayesian network as well as the 2 TBN that tells us the transition level.

Â And it's certainly the case that one can take those two pieces and

Â construct an unrolled or ground Bayesian network.

Â And once we've done that,

Â that ground Bayesian network is a Bayesian network just like any other.

Â And so whatever inference techniques we have developed for answering queries for

Â any Bayesian network can be used in this context as well.

Â So, specifically, once we unroll the DBN for given trajectory length, we can run

Â your favorite inference algorithm over the ground network to answer questions,

Â such as what is the location of the car at time 2,

Â given whatever evidence we might like to give it, so

Â observations up to time 2 or observations for the entire duration of the sequence.

Â 2:18

What are some of the challenges though?

Â So part of the difficulty or new dimensions that are offered by some of

Â these models is the fact that they establish new problems.

Â This is particularly true in the context of temporal models,

Â where we might often be interested in what you might call belief state tracking.

Â That is keeping track over the state of the system as it evolves.

Â Now in some ways,

Â this is just a traditional probabilistic inference task because it corresponds

Â to asking what is our probability distribution over the state of time t,

Â given the observations that the agent has had access to up to time t.

Â So this over here, what we see as o(1:t) is

Â just a shorthand for o1, o2, up to ot.

Â And so this is an inference task that can be answered using standard

Â probabilistic inference techniques over the unrolled network.

Â But there is a challenge that needs to be addressed

Â if you want to keep track of this probability distribution over

Â the course of a long trajectory without having a potentially unboundedly

Â large network that we have to keep running inference over.

Â So it turns out, fortunately, that one can do this in a dynamic

Â way without keeping track of the huge unrolled network at all points in time.

Â And this just comes directly out as a consequence of the Markov properties of

Â the graphical model.

Â So, specifically, we're going to do this as a two stage process,

Â where the first thing we're going to compute is this sigma(dot t + 1).

Â And the position of the dot indicates that although it's

Â a distribution over the state at time t + 1,

Â it's a distribution that doesn't take into account the time t + 1 observation.

Â And so this is defined to be, as written here,

Â the probability of the state of time t + 1, given the observations up to time t.

Â And so now we can do fairly straightforward

Â manipulations of this probabilistic expression.

Â So the first thing is we inject S(t) into the right

Â hand side of the conditioning bar and sum out over it, so

Â this is now a probability of S(t + 1), given S(t) and observations up to

Â time t times the probability of S(t) given the observations up to time t.

Â And now we can apply

Â indepedencies that arise from the structure of the graphical model.

Â And specifically, the fact that the state of time t + 1 is independent of everything

Â that occurred before given a state of time t, which is going to allow us to

Â 5:08

And on the second term, which is probability of S(t) given o(1:t),

Â this is just a previously computed belief state, as it's called,

Â the sigma(t) that we are trying to keep track of.

Â And so this allows us to take sigma(t) and produce sigma(dot t + 1).

Â The second step is now to count for the observation model at time t + 1.

Â So, let's look at what that would involve.

Â So, now, we have sigma(dot t + 1), which we defined on the previous slide,

Â and so now we want to condition that on the new piece of observation,

Â of the observation to find t + 1.

Â And this is defined, so just, from that to derive the belief state, sigma(t + 1).

Â And so here we're just going to take that definition, probability of S(t + 1)

Â given all of the observations, the ones that went through t and the new one.

Â And we're going to apply Bayes' rule to define this and to reformulate this,

Â and that gives us probability of o(t + 1), given S(t + 1) and

Â the previous observations, times the probability of S(t + 1),

Â given o(1:t), divided by the probability of o(t + 1), given o(1:t).

Â This is a straightforward application of Bayes' rule, and once again, we can

Â now examine each of the terms in this expression and see what it corresponds to.

Â And looking first at this first term in the numerator,

Â we can see that this is the probability of the observation at time t

Â given the state of time t and a bunch of things that happened in the past.

Â And once again, conditional independence tells us that the stuff that happened in

Â the past is irrelevant and to be removed because of conditional independencies.

Â 8:08

this might manifest in the context of an actual belief state tracking problem.

Â So, here is a situation where this model might be applied, so

Â let's first explain what we're seeing before we see a demo.

Â What you see in the green circle is the robot true position,

Â which is not known to the robot, because the robot's trying to localize itself.

Â These blue lines are the observed sonar readings that the robot

Â gets at each point in the process.

Â And so you can see that the returned length of the sonar

Â readings correspond roughly to the distance of the robot from the wall.

Â But there is some noise here.

Â So for example, sometimes for whatever reason, the sonar appears to go

Â through the wall even though it should have been giving us a reading that

Â is considerably closer because it should have been reflected from the surface.

Â So sonars are quite noisy, and the visualization that we see reflects that.

Â The red dots that you see is a visualization of the robot's beliefs over

Â where it is.

Â In the initial stage this is a uniform probability distribution and

Â because the robot doesn't know where it is.

Â And as we'll see, this gets clumpier and clumpier over time as the robot's belief

Â state over where it's likely to be becomes more and more definitive.

Â So let's look at what this looks like over time.

Â So what we're going to see is the robot's beliefs as it gets more and

Â more observations.

Â 10:25

The second challenge associated with these template-based

Â models are computational issues.

Â So sure we can produce an unrolled Bayesian network and compute a posterior

Â over any subset of the variables using standard inference techniques.

Â But one of the consequences of being able to produce these very large

Â probabilistic graphical models from a fairly small template is that we can

Â produce very large probabilistic graphical models from a very small template.

Â And large probabilistic graphical models can pose new inference challenges

Â in terms of the ability to scale inference to models of that size.

Â So, specifically,

Â if we look at the unrolled model that arises from an unrolled DBN and

Â thinking back to some of the analysis that we did regarding the complexity

Â of probabilistic inference for a particular probabilistic graphical model,

Â we remember, for example, that if we want to run inference,

Â the exact inference, say, a clique tree over this unrolled network.

Â Then for example, if we want this to have a clique tree where,

Â say, the time 0 variables are in one part of the model and

Â the time t variables for some future t are on some other clique and

Â all, so they're not all together in one big clique.

Â Then the minimal subset needs to separate these variables over here.

Â And we say the blue variables, the time 0 variables, from the green variables.

Â So the subset must separate them.

Â And if you think about what separation imposes on us in this setting,

Â we can see that the only way to separate the, for example,

Â these variables, the blue variables over here,

Â from the green variables over here is to put in the subset at least

Â the persistent variables, that is the ones where there is an edge

Â from the variable at time t to its incarnation at time t + 1.

Â And so the minimal subset in this context, the smallest subset that we can

Â construct that would allow us to have different cliques for, say, time 0 and

Â time 2, is something that involves all of these variables in the middle.

Â And so, that can potentially involve a significant computational cost for

Â exact inference, especially when we have a large number of persistent variables.

Â A different way of understanding this is via the concept of entanglement,

Â if we're thinking of belief state tracking.

Â So if our goal is to maintain a belief state over, say, the variables of time 3,

Â and we're trying to think about how can we maintain this probability distribution,

Â the sigma of time 3, in a way that doesn't involve

Â maintaining a full explicit joint distribution over the variables time 3.

Â We quickly realize that, really, we have no choice, because there are no

Â conditional independencies within, Time 3.

Â Now you might say, how is that?

Â This is a nicely structured model.

Â Why wouldn't there be conditional independencies?

Â Well, look at what's going on here.

Â Can we say that weather is, at time 3,

Â is conditionally independent of failure at time 3?

Â Well, locally, they're not connected to each other within the time slice,

Â but there's certainly an active trail between them that goes,

Â for example, like this, from weather time 3, weather time 2,

Â weather time 1, failure time 2, to failure time 3.

Â And so, this is an active trail between these two variables, which means that they

Â are, when you consider only the variables at time 3, not conditionally independent

Â of each other, given any of the other variables at time 3.

Â And so, this entanglement process occurs very rapidly over

Â the course of tracking a belief state in a dynamic Bayesian network,

Â which eventually means that the belief state, if you want to maintain

Â the exact belief state, is just fully correlated, in most cases.

Â 14:57

And so, this is a computational consequence of the fact that we have

Â a very large Bayesian network.

Â And if you want to think about just how large it can get, we can go back to

Â the real vehicle tracking example that we talked about in a previous lecture,

Â and note that all of these variables over here are all persistent variables.

Â And so, belief state tracking would have to maintain if it were to be done

Â exactly a full joint distribution over all of these variables, which is likely to

Â be intractable, certainly if you want realtime performance and

Â probably not even without that constraint.

Â And so, there has been a lot of work on approximate methods in the context of,

Â say, temporal models, because of the computational issues that they raise.

Â 16:18

MRF, which is what we really have here, where you have the variables on the left

Â are connected to the variables on the right.

Â And in general, assuming that the grades are fairly dense,

Â we mentioned before that for exact inference,

Â the lowest cost that we can expect is if you have m variables on the left and

Â n variables on the right, it's the minimum of (m,

Â n), so, the minimum of the number of courses and the number of students.

Â And again, you can come up with cases where this is going to be lower than that.

Â But in general, for bipartite MRF like this, that's what you're going to get.

Â And assuming that the university has a reasonably high number of courses and

Â presumably even more students, this can become intractable very quickly,

Â giving rise to the need for approximate inference methods.

Â 18:03

considering the connections between the web pages.

Â So specifically, by considering edges between, in this case,

Â undirected edges between the category of pages that are linked to each other.

Â So, for example, we see, and this is a model that was learned from Dana,

Â that students are quite likely to point to faculty, and

Â faculty are somewhat less likely to point to students.

Â And faculty are even less likely to point to other faculty.

Â And so that gives us some information about the correlations between labels

Â of web pages that are linked to each other and gives rise to improvements

Â in performance that one gets that are actually quite considerable.

Â So, for example, a reduction in error from

Â 18% to a little bit over 12%.

Â To summarize, inference in template and temporal models can be done in principle

Â by unrolling the ground network and using standard inference methods.

Â However temporal models specifically raise new inference tasks, such as real time

Â belief state tracking, which requires a certain adaptation in our methods.

Â