0:00

[MUSIC]

Â Are ice cream sellers evil?

Â Probably not, well, at least not all of them.

Â But I can totally imagine a situation where the price of the ice cream goes

Â up whenever the temperature outside goes up.

Â And if it's indeed the case, we can see a plot like this.

Â Here on the x-axis we have temperature, and

Â on the y-axis we have the price of the ice cream.

Â And each data point, on some particular day, we measured the temperature,

Â we asked some ice cream seller about his price, and

Â we plotted this data point in the two-dimensional plane.

Â So we can see that these two variables are strongly correlated here and

Â related to each other.

Â Can we exploit this closeness of this meaning of these two random

Â variables to each other?

Â Well, we may say that these two variables are so

Â related that you can use one to measure the other.

Â For example, if you want to know your temperature outside and

Â you forgot your thermometer and also the smart phone.

Â You can ask your closest ice cream dealer for his price and

Â compute the temperature from that.

Â Which basically means that these two numbers are so

Â related that you don't have to use two.

Â You can as well use just one of them, and compute the other from the first one.

Â Or if you put it a little bit differently, you can draw a line which

Â goes through your data and kind of aligned with your data.

Â Then you can project each data point you have on this line.

Â And this way, instead of two numbers to describe each data point,

Â you now can use one, the position on this line.

Â And this way you will not lose much.

Â So if you look at the lengths of the projections, so

Â how much information do you lose when you project points?

Â And each blue data point is projected on the corresponding orange one.

Â You see the lengths are not high, so you keep most of

Â the information in your data set by projecting on this line.

Â And now, instead of this two-dimensional data,

Â you can use a one-dimensional projection.

Â So you can use just the position of

Â this line as your description of the data point.

Â And it's just another way to say that these two random variables are so

Â connected that you don't have to use two.

Â You, as well, may use just one to describe both of them, and

Â this is exactly the idea of dimensional introduction.

Â So you have two-dimensional data and you project it into one dimension,

Â trying to keep as much information as possible.

Â And one of the most popular way to do it is called

Â principal component analysis, or PCA, for short.

Â And PCA tries to find the best possible linear transformation,

Â which projects your two-dimensional data into 1D.

Â Or more generally, your multidimensional data into lower dimensions,

Â while keeping as much information as possible.

Â So PCA is cool.

Â It gives you an optimal solution to this kind of problem.

Â It has analytical solutions, so

Â you can just write down the formula of the solution of the PCA problem, and

Â this analytical formula is very faster implement.

Â So if you give me 10,000 dimensional points,

Â I can return you back the same points projected to ten dimensions, for

Â example, while keeping most of the information.

Â And I can do it in milliseconds, so it's really fast.

Â But sometimes people still are not happy enough with this PCA,

Â and try to formulate this PCA in probabilistic terms, why?

Â Well, is usually formulating your usual problem in probabilistic terms

Â may give you some benefits, like being able handle missing data, for example.

Â So in the original paper that proposed this probabilistic version of PCA,

Â they try to project some multidimensional data in two dimensions,

Â so you can now plot this data on two-dimensional plane.

Â And they then try to obscure some of the data, so

Â introduce missing values into the data.

Â They thrown away some parts of the features, and

Â then they projected this data set with missing values again.

Â And you can see that these two projections doesn't differ that much,

Â which means that we don't lose that much information by throwing

Â away some parts of the data, which is really cool, right?

Â We were able to treat this missing values, and

Â the solution doesn't change when we introduce them.

Â So we're really robust to missing values.

Â By the way, the paper where they proposed this principal component

Â analysis is really good.

Â So check it out if you have time and

Â if you want to know more details about this model.

Â So let's try to drive the main ideas behind this probability

Â principal component analysis in the following few slides.

Â So first of all,

Â it's natural to call this low dimensional representation of your data.

Â So in this example, one-dimensional position of each data point,

Â of each orange data point, to call a latent variable.

Â Because it's something you don't know,

Â you don't observe directly, and it causes your data somehow.

Â So the position of your orange data point on the line, this ti,

Â it influences where the data point will end up on the two-dimensional plane.

Â So it influences the position of the observed point, right?

Â So it's natural to introduce this latent variable model where you have ti,

Â which causes xi.

Â And you have to define some prior for ti, and

Â why not to set it just to standard normal?

Â This will just mean that your projections,

Â your low dimension projections,

Â will be somewhere around 0 and will have variance around 1.

Â Which, why not?

Â It's nice property to have.

Â Now we have to define the likelihood, so

Â the probability of x given ti, and how does x and ti connect?

Â So how does this one-dimensional data and two-dimensional data is connected?

Â Well, if you look at the orange,

Â kind of orange two-dimensional x on the projection of x,

Â then it equals to some vector times the position of

Â this one-dimensional line plus some shift vector.

Â So we can linearly transform from this one-dimensional line

Â to two-dimensional space and get these orange projected points.

Â Or more generally, we can multiply ti by some matrix W, and

Â then add some bisector, b, and we'll get our orange projections, xi.

Â And this W and b will be our parameters,

Â which we aim to learn from data.

Â Okay, but this is orange points, right?

Â How can we recover the blue points, the original data?

Â 8:05

Well, it's kind of, I don't know how to do it.

Â I mean, it's impossible to exactly say where the blue point will be if you

Â know the orange point, because you don't know how much information you lost, right?

Â But you don't have to say it exactly,

Â you can just model this on how probabilistically.

Â So let's say that the blue point, xi, which we observe,

Â is just orange point plus some random noise, which is centered around 0.

Â And has some covariance matrix sigma, which we'll also treat as parameter.

Â This way we're kind of saying that our blue observed

Â data points are the same as the projection of our

Â one-dimensional data into 2D plus some Gaussian noise.

Â Which means that we don't actually know where the blue points occur, but

Â we expect them to be somewhere around the orange points, around the projections.

Â Okay, so we have a latent variable model like this,

Â so ti causes xi, and we have defined the model fully.

Â So we have prior of ti is standard normal, and we have likelihood.

Â So xi given ji is some normal distribution also.

Â Now we want to train this kind of model.

Â So we want to find the parameters which are, for example,

Â maximum likelihood estimation.

Â Well, first of all, as usually we will assume that the likelihood is

Â factorized into product of likelihoods of individual objects and data points.

Â And the likelihood is equals to the product of these likelihoods of data

Â points.

Â And then we can rewrite this marginal likelihood of

Â individual object, by marginalizing out ti.

Â So it's the joint distribution, p of xi and ti, and then we have to sum out ti.

Â But previously we had sums, now we have an integral,

Â because this latent variable ti is continuous.

Â And to sum it out it means to integrate it out.

Â Note that in general, this integral is intractable, and it's really hard to

Â optimize this function, because we can't even compute it at any given point.

Â The integral is intractable.

Â So it's really cool that Algorithm allows you to optimize these kinds of functions.

Â Although, you sometimes you can't even compute them at any given point,

Â but in this particular situation, you don't need that.

Â So everything is normal.

Â Everything is conjugate here,

Â which means that you can analytically integrate this latent variable gi out.

Â So you can now do this integral, and it will also be a normal

Â distribution with some parameters, which you can look up in Wikipedia.

Â And then you have a product of Gaussians, and

Â you can analytically compute the optimal parameters.

Â So you can take the logarithm of this thing.

Â You can compute the gradient, and then you can set this gradient equal to 0 and

Â find the maximum likelihood parameters analytically.

Â 11:16

And somewhat unexpectedly, we will find out that

Â the optional parameters of this probabilistic

Â model is exactly the same as the formulas for PCA.

Â So look what happened.

Â We started with PCA, we interpreted it probabilistically.

Â We found the maximum likelihood parameters for this probability model,

Â and they turned out to be the same as the original formulas for PCA.

Â It's kind of unsettling, because we spent this whole,

Â I don t know, ten minutes, and we didn't get anything useful from that.

Â We get the same formulas as PCA, but

Â it turns out that this probabilistic interpretation is still useful.

Â So here we don't need Algorithm at all,

Â because we have everything analytical and nice.

Â But if we change the model a little bit,

Â then we will not be able to compute anything analytically anymore.

Â But with We'll still be able to train it.

Â So let's say you introduce missing values.

Â You do not observe some part of your xis, then you have more latent

Â variables than you used to have, and then you can't find the latent,

Â you can't find the maximum likelihood parameters analytically anymore.

Â But you can still apply And this will give you some valid solution.

Â So in the next video,

Â we will talk a little bit about how to apply Algorithm in this case

Â