0:00

We previously defined the map inference problem,

Â which is that of finding the highest probability assignment to a graphical

Â model. And we talked about particular classes of

Â models for which the map problem was tractable.

Â One was a class that of models that had sufficiently low-tree width so that one

Â could use techniques such as the variable elimination or clique tree methods.

Â And then we also talked about a variety of other special purpose classes where

Â one could do tractable inference. What we're going to talk about now is a

Â general purpose algorithm that can be used to solve any math problem.

Â Keeping in mind, of course, that the math problem is an empty hard problem and so

Â you're not going to get a fully polynomial time algorithm that that one

Â can utilize in the context of any problem.

Â So, the class of methods that we're going to talk about is an is called dual

Â decomposition. And, it's derived from the increasingly

Â prevalent view of the map inference problem as an optimization problem and

Â building on techniques from the field of optimization theory.

Â So, let's go ahead and first reformulate the problem in a way that makes it

Â amenable to this kind of analysis and this is just a convenience.

Â So, here we're going to assume that the math problem that we are trying to

Â optimize, remember, we reformulated this as a max sum as opposed to a max product

Â which is a convenient reformulation. And we're going to assume that the theta

Â that we are trying to optimize is comprised of the sum of two different

Â kinds of factors. The first is singleton factors which are

Â over the scope of a single variable Xi and the second are larger factors that

Â are over the scope of multiple variables. And, clearly, one can fold in the

Â singleton factors into some larger factor that contains a variable in its scope,

Â but it turns out to be convenient to keep them in the model for reasons that will

Â become clear in a minute. So now, our goal is to find the value for

Â the moment. We're going to talk about how to find an

Â actual assignment in a little bit. But imagine that our goal is to just figure

Â out what is the value of the highest probability assignment X and so we're

Â trying to find the X that maximizes this total summation.

Â 2:29

But one thing that we could do, which would be wrong, but we could do it is to

Â say, well, let's forget about the fact that we're trying to do a single joint

Â assignment, rather, we're going to think about each

Â problem with blinders on. So we're going to think about each theta

Â i, Xi by itself and we're going to find the Xi that optimizes that factor and

Â we're going to do the same for the larger factors.

Â Now, that of course, is going to be a very poor solution, because the Xi that's

Â going to optimize one theta i is going to be completely inconsistent with the Xi

Â that is selected by one of the larger factors that contains Xi.

Â So you're going to get a completely incoherent joint assignment. It's not

Â going to be a joint assignment, it's going to be a bunch of different

Â assignments that don't that don't agree with each other.

Â So, what dual decomposition does is it's going to try and do the same kind of

Â divide and conquer, but it's going to do it in a way that tries to force these

Â local decision making problems to agree with each other.

Â 7:00

And only a function of lambda, because it's no longer a function of X, because

Â of maximized over the X. So you're going to get a different value

Â of this function depending on the choices of the penalty parameters.

Â This function is an upper bound on that theta for any value of lambda.

Â Now let's understand why that is. It's, we've seen in this line over here

Â [SOUND] that this is actually equal to that.

Â And we, we've shown that everything cancels here, and what I've done here is,

Â over here I'm forcing the single X to maximize this entire expression together,

Â and here, I'm letting each term be optimized seperately.

Â And that gives me more degrees of freedom,

Â because here, I had to pick the same X across the board, and here, this one gets

Â to pick one X and this one gets to pick another value of X.

Â And so since the assignment where they all agree is is one of the assignments

Â that can be considered in this optimization problem over here,

Â 8:39

So, introducing some notation which we'll use in a little, which we'll use in the

Â rest of this presentation. We're going to call the function that the

Â i slave is optimizing theta i lambda and the function that is being optimized by

Â the F slave theta F lambda. So let's take an example to make this a

Â little bit more concrete, because this was a little bit abstract.

Â So let's go back to our to our example of a four-way loop, where we have X1, X2,

Â X3, and X4. And we're going to have these four

Â pairwise potentials, which we're going to denote with letters this time to

Â disambiguate indices a little bit. So we have theta F(X1, X2), theta G(X2,

Â X3), and theta H and theta K. So now, how is this decomposition going

Â to work here? Let's assume for the moment that we're

Â going to decompose this over edges. So so our factors are going to be, for

Â example, the edge X1, X2. So now, what is the optimization problem

Â that this factor chooses? Well, it has its own potential,

Â theta F(X1, X2), and then it's going to have these little

Â costs that are going to try and get it to agree with X1 on there with with the one

Â slave on X1 and with the two slave on X2. We're similarly going to have for X1, X4.

Â This is the K, so this is the F slave, this is the K slave and it's going to

Â have its own potential, theta K, and two penalty terms that are going to

Â try to make, to make it agree, with the one slave on x1 and with the four slaves

Â on x4. And we similarly have exactly the same

Â form for the other two slaves, the G slave and the H slave.

Â In addition, we have slaves for the individual variables, the one, two,

Â three, four slaves. So here is the one slave and notice that the one slave has

Â its own potential theta one X1 and at the same time it has these two terms that are

Â trying to make it agree on the one side with F, which is over here, so this is

Â because of this and on the other side with K because of that.

Â 11:48

Now I've done this in the context of a very simple scenario, where I've broken

Â up my model into the factors that existed in the original specification.

Â So, for example, I had factors that were pairwise potential, so I had a slave for

Â each such factor. But, that doesn't have to be the case.

Â [COUGH] In fact, the only constraint that I need to satisfy is that I need to, to

Â define a slave to be defined as a subset of factors that admit a tractable

Â solution. So that each one can be optimized

Â efficiently when considered in isolation. So for example, if we can, if we return

Â to this network, say, one thing we could do is we can introduce two larger slaves,

Â instead of the four smaller F, G, K, H slaves which one is corresponding to F

Â and G together, so this is the F, G slave and the other

Â one representing the K, the K, H slave. And in this case we think about the

Â agreement, we still have the one, two, three, four slaves that represent the

Â individual variables. And, so now, we're going to have to

Â require that theta F agrees with X1 with a one slave on X1 with a two slave on X2

Â and a three slave on X3. And so we have this expression, which has

Â a lambda term for each of these three variables.

Â And we have a similar term for a similar expression for the K, H slave, where in

Â this case, the K, H slave has to agree with the one slave on X1, the four slave

Â on X4, and the three slave on X3. If we now think about the singleton

Â slaves, for example X1, we see that X1 has to agree with the F, G slave so there

Â is a term to encourage that and with the K, H slave,

Â because it's present in both. On the other hand, the X2 variable only

Â occurs in the first slave and so it only has a single agreement term with with the

Â F, G slave and similarly for X3 and X4. So, how do we then construct the slaves?

Â In pairwise networks, what we typically do is we divide the factors into a set of

Â disjoint trees, and this is in fact, what I showed in the previous slide. We

Â divided it into one tree that went through the first two edges and the

Â second one that went through the second two edges.

Â And then, and we define, and they are disjoint, which means every edge factor

Â is assigned to exactly one tree. And because they're trees, we know how to

Â optimize them efficiently using techniques like variable elimination or

Â clique tree. But we also discussed that there is other

Â classes of factors that are, that offer tractable inference.

Â So, for example, we talked about matchings or models that are associative,

Â or regular, or super modular, or whatever we call them.

Â and all of these are tractable classes and we can take a whole bunch of factors

Â that satisfy, that, that fit into one of these

Â tractable categories and optimize them together in one fell swoop.

Â We don't need to divide into separate edges.

Â So let's look at an example of this. Remember, a while ago, we talked about

Â the problem of 3D cell reconstruction, where we have a a cell over here that

Â we're imaging and we're taking various images that are slices two-dimensional

Â slices through that cell, and we'd like to reconstruct three-dimensional

Â structure. So here, as as we discussed, we have the

Â slices and we would like to correspond the beads that we see, these little gold

Â beads that we see in these different images to each other,

Â so trying to figure out that this bead corresponds to this bead over there.

Â And from that, we can then subsequently obtain the 3D reconstruction.

Â Now, when we last talked about this, we described this as a matching problem

Â where the matching weights between a point here and a point here are derived

Â from the similarity of both the location in the 2D slice and the appearance of the

Â local neighborhood. But it turns out that neither of these is

Â sufficiently distinctive in order to really make that determination robustly,

Â that one point in, in an image matches another point in the other image.

Â And so, in order to do this with a reasonable success, we need to have

Â another set of potentials, which are pairwise potentials which basically

Â preserved distances of these marker positions.

Â So, tell us that if we had, if we want to correspond this point over here to this

Â point, and this point to that point,

Â then the distances between them should also roughly be preserved.

Â Now, this set of potentials is relatively sparse, and therefore, can be solved

Â usually using exact inference. This set of potentials is is complicated,

Â is not, is not complicated on its own because it's a matching problem.

Â If you put them together, it turns out to be a difficult problem to solve.

Â In fact, you can show that a joint set of potentials like of, like this, with these

Â two different categories is in [INAUDIBLE] hard, but using dual

Â decomposition, we can solve each of these separately and then communicate

Â information between them, so each of these would form a tractable

Â slave.

Â