0:00

Perhaps the most common use of the expectation maximization algorithm is to

the problem of Learning with Latent Variables these are situations where

there is a certain set of variables that we just never get to observe but they

turn out to be important for capturing some important latent structure in the

distribution of the data instances that we have.

So let's look at a number of these examples and talk about how 'em was

applied in those settings. So perhaps the simplest example, is in

context of base in clustering. Where we have some kind of, latent class

variable and some set of observed features.

And we're trying to, basically segment the instances that we have into

categories. Where, we assume that, for each category

there is a similar distribution over the observed features.

And, specifically, in many of these applications, that distribution takes the

form of a [INAUDIBLE], [INAUDIBLE], and that is, that the features are

independent, once we're given the class. So one example application of this is to

segmenting a set of users into similarity groups.

And so one cute example of that is the following segmentation of users on the

MSNBC website based on which of the stories on the MSNBC site the users chose

to click and this is due to Jack Breeze and colleagues at Microsoft research So

here we see. We're going to see four clusters.

The first of these is a cluster that a human given name to,

because the machine doesn't give a name to the cluster,

it just calls it cluster one. Turns out to be readers of commerce and

technology stories. So these are people for whom the highest

probability clicks on stories were for example email delivery isn't exactly

guaranteed or should you buy a DVD player.

So remember, the stories are the features and their binary variables did you click

or did you not click, on that particular article.

And the cluster is the segmentation of the users into these different

categories, so this is category one.

the second cluster are people who primarily clicked on sports stories.

So, the highest probability features in this, for this user group were for

example, stories such as Cowboys Are Reborn in Win Over Eagles,

or something like that. The third cluster are the people who went

to the top page and basically clicked on the top stories.

This says twenty, 29% of the users. And

one can. Date this by, at least this article, oh

this one seems pretty perennial. and so we can so that's a third category.

The last one was the surprising one. These are people who traversed all over

the site. So for the previous three categories,

there were actual pages that existed for sports, for top promoted stories, and for

commerce and technology. This last category are people who went

all over the site to look for stories that you might call softer news.

and, so, this was an interesting discovery for the MSNBC website about its

user population and suggested that the website might be better redesigned to

capture this kind of structures and give these users a page where they can find

these articles. A very different application is to speech

recognition, which is one that we've talked about before, in the context of

speech recognition, we've already, discussed that the standard

representation of choice is a hidden Markov model, and in a hidden Markov

model, we have hidden variables and we have parameters.

The parameters are the parameters of the HMM, the transition probabilities for

example, between one phone and the other, or the acoustics or the probability over

the acoustics signal that we observe for different pieces or states within a

phone. The latent variables are the segmentation

of the initial acoustic signal into which state it belongs to.

That is which full name and which part of which full name,

those are the hidden variables. And that's an awful lot of hidden

variables. Because if we are just given an

undifferentiated, unsegmented signal, it's very difficult to assign.

As, as a piece, a small slice of that segment of, of that speech into one fully

versus the other. So in order to train well a speech recognition HMM, the

standard strategy is a bottom up bootstrap training where one first trains

the phone HMM separately for each phone using, a phone database.

5:08

Now this to is done using the m because the breakdown within a phone into these

little constituent pieces is not labeled. And so it's still requires that we d 'em

to address the issue of latent variables, but at least it's now self contained,

because you're only training a model for a single phone.

like puh. With that model trained, we can now one

can now take entire words, and use the model, initialized with the phone with

the model trained on individual phones, to now train the higher level model.

And one still retrains the phone HMM parameters.

In the context of this larger training regime where one trains on entire words,

but the fact that one seeded the model with this much more correct

initialization on the parameters allows the segmentation in the E step to be

performed moderately correctly and give rise to a much better local optimum in

the speech recognition problem. A yet different application is that of 3D

robot mapping. So, this is an application that's due to

thrun at all, and . Here the input to the, to the problem is

a sequence of point clouds obtained by a laser range finder that were collected by

a moving robot and the target output is a 3D map of the environment as a set of

plains and you will see the difference as to why we want the plainer map when we

look at the demo. Here the parameters of the model are the

location and the angles of walls or the plains in the environment, so we have no

idea of priori where there walls and how there situated.

The latent variables are as an association.

Variables, which assign points to walls. And so the EML Rhythm effectively in the

E step figures out which points go with which wall, and in the M step, it figures

out how to move the wall around to better fit the points that were assigned to it.

So what we see here is the raw data collected by the robot.

The red box is the robot moving around in its environment.

The red beams that emanate from the robot are the directions that the laser took in

order to collect the point cloud and what we see here is the point cloud that was

collected by the robot as a Traversty environment.

And even just looking at this image we can already see that there is a lot of

noise in the laser range data, and that and that is going to give rise as we can

see to a very noisy map of the environment.

If we now look at the marvel constructed by Yen.

So now we are going to see the planer map constructed by the robot.

And this is done on the fly actually as the robot is moving.

We can see that. walls are constructed when there is

enough data to support them and, that's when enough points are assigned to a wall

then, that wall gets constructed and it's pose in the environment gets determined

by the EN algorithm. And, so we can see that everybody's

construct a much more plausible and realistic map of the environment than

just looking at the raw point cloud data. .

A, different application is also in the context of 3D laser range scans.

And, you know, we pick those, not because these are in the most common applications

of the M. But because they give rise to some pretty

cool movies. So, here is, the problem of getting, in

this case 3D, brain scans of a person in different

poses. And the goal is to see whether by

collection a bunch of these poses one can reconstruct a 3D skeleton of.

Person. So from front and front back.

So, in this particular case. the first problem is actually to

correspond points in the difference cans to each other and in this case that was

done by a different algorithm that I'm not going to talk about now although it

also used graphical models in fact it used a belief propigational algorithm.

But now lets talk about the clustering problem that is the problem of assigning

points in each of those scans to body parts.

So here we have the notion of a cluster which in this case corresponds to a part

and. Each part.

in a given, in a given, scan of the person, has a transformation, so that's

why we have a plate around, each of the parts, because we have an, extenty, fo,

because, for each of those parts there is multiple estantions and then multiple,

scans that we have the same person in different poses.

Now a landmark is a particular point on that

On that person, in that, in that part, and if we knew the part, that is, if we

had the part assignment, which is a latent, unobserved variable, then we

could predict how the point that we see. On, this scan, would translate to the

point on this scan. Because that would be effectively a,

deterministic or close to deterministic function of the unobserved,

transformation of the part. That is the fact that the arms, they

moved from this pose over here to this raised arm pose over here.

So given the transformation and given that we know which part.

The point or landmark is assigned to. We can predict how the point transformed

from its original position to its new position.

So this is a way of clustering points into parts and one can run 'em on that

and if one does that effectively one gets pretty much garbage, because it turns out

that there is enough muscle deformation to make the actual positions here fairly

noisy and it's very difficult to get correct part assignments from that, but

if we now add an additional component into this model.

Specifically continuity of space, we can now consider say two points that are

adjacent to each other and impose the soft constraint, not the hard constraint

but a soft constraint that a part assignment of two points that are

adjacent should be softly the same, doesn't have to be exactly the same

because otherwise of course everything would be assigned to the same part, but

it's a soft constraint which is actually an MRF constraint.

In fact it's even an MRF constraint, which is associate or regular.

As we discussed before. And so these this model now that actually

does the clustering not of each point in isolation but rather assigns points

jointly to parts taking into consideration the geometry of the

person's body is a considerably better gives rise to considerably better

outputs. And what we see here is the algorithm in

action. And we can see that the algorithm

converges very nicely to a partition. that points into different parts and from

this one can easily reconstruct the skeleton.

A final application that uses 'em in a different model is one that was used in

this helicopter demo alignment. Here the input and this is work that was

done at Stanford by And the rings group.

And, here the input to the algorithm is basically, different trajectories of the

same aerobatic maneuver flown by different pilots.

So they were all trying to do the same thing, but each person has their own

idiosyncratic way of doing this. And so the, the exact sequence wasn't, as

we'll see, exactly the same. The goal of this was to produce an output

which aligns the trajectories to each other and at the same time learns a

problemistic model of what the target or template trajectory ought to have been.

14:25

So how does this get represented as a graphical model, here we have, a latent

trajectory which is the intended trajectory that, that, the users were

aiming for and, the states an nose are labeled as zees and we have parameters,

that tell us how, one would move from Z0 to Z1 and so on.

So those are the model parameters of this mark off model.

although it's a hidden Markov model, as we'll see.

The expert demonstrations, that is, what we saw in terms of what the expert did at

each point in time, is a noisy observation of the intended trajectory.

But, we don't actually know which point in the trajectory the expert observation

comes from. And so we have this additional set of

latent variables, which are the time indices.

And they basically tell us that at time tau zero.

The demonstration which of the. which of the true, states in the intended

trajectory, the expert was at, at this point and, in towel one, tells us which

point in the intended trajectory the expert was in time one, and similarly for

time two, and so on. We're, Kebba is the, is the expert

demonstration, so in the, so K is the index, on the expert.

17:32

So one important issue regarding latent variables is the number of values of the

latent variables, and this is something that, has been implicit in many of the

applications that I've mentioned in this module.

So, how many user clusters do we have? Or how many, different pieces of a phone

are there? And, so how do we pick that cardinality

for the latent variable? If we use likelihood to evaluate a model

with fewer values for the latent variable versus one with more.

It's in the same way that likely it always over fits.

We will find more values is always better, because it is a strictly more

expressive model and can therefore achieve a higher likelihood.

Now, one can circumvent that in the same way that we circumvented, the likelihood

overfitting phenomenon in the context of other structured learning algorithms.

So we can use, for example, a score that penalizes complexity.

Such as the BIC score. as we mentioned back then, the BIC score

tends to under-fit and so there has been a range of extensions to the Bayesian

score for the context of incomplete data. These are always approximations because

we can't actually do the integration required for the Bayesian score for

incomplete data. And so we have approximations to the

Bayesian score that turn out to work much better in practice than BIC in many

practical applications. There's also other strategies that people

have adopted. For example, we can use metrics of

cluster coherence to look at a cluster and say.

Ooh, it seems like this cluster is really incoherent.

So maybe there's really two clusters in there.

So maybe we should split it up. Or if these two clusters are very similar

to each other, maybe we should merge them.

And there is various heuristic exploration algorithms that basically

split clusters, and add clusters in a way that hopefully converges to the right

number of clusters. And finally of great popularity in the

last few years, is a range of methods that are based on Dasian techniques.

these are methods, often, the most commonly used class is called [UNKNOWN]

prostheses. Which instead of picking a single

carnality for the laten variable, basically maintain a distribution, over

the carnality. And then, instead of picking a single

value for that cardinality average over the different cardinalities using often

techniques such as Markov Chain Monte Carlo.

Although there's also other approaches. But this turns out often the, the problem

of picking out latent variable cardinality is often one of trickier

issues that one has to do deal with when learning with latent variables.

To summarize. Latent variables are a really common

scenario in the context of learning graphical models.

And one of, perhaps, the most common setting for incomplete data.

And it turns out that, when we want to construct models for richly structured

domains. The introduction of these latent

variables can be, Can be critical in the success of the

model. We it's, we've already mentioned that

latent variables satisfy the missing at random assumptions.

And so the expectation maximization algorithm is applicable in this case.

And, and is very commonly used for for optimizing latent variable models.

However, it's we've, we've already discussed this previously, as well.

That when we're learning with latent variables, we have serious problems both

with identifiability. Of the learned model as well as well as

multiple optima that are often even symmetric reflections of each other which

is a faceted unidentified ability but also ones that are not reflections or

just symmetric permeantations of each other and those really necessitate a good

initialization strategy. And we mention that picking variable

cardinality for the latent variables is a key question in in these latent variable

models and one to which a lot of work has been devoted to especially in the last

few years.