[MUSIC]

Hi, I am Dmitry Vetrov, research professor from Higher School of Economics and

head of Bayesian Method Research Group and scientific advisor of Alex and Daniel.

Under this lecture, I would like to tell you about one

successful example of how [INAUDIBLE] can be combined with Bayesian influence.

Under this lecture,

we will briefly review how budget methods can be scaled to big data.

So, suppose we are given a set of machine problem containing data X,Y,

where X contains observed variables and Y is hidden variables to be predicted.

And we have a probabilistic classifier, which gives us the probabilities of

a hidden components, given observed ones, which is parameterized by y|x, W.

Since we are Bayesians, we also establish reasonable prior, p(W).

And from the Bayesian point of view at the training stage,

we need to compute two posterior distribution, p(W|X, Y).

This posterior distribution contains all available information about W

we could extract from our training data.

And this is the result of Bayesian training.

At the test stage,

we need to perform [INAUDIBLE] with respect to this posterior distribution.

So we're not applying just single classifier,

we're applying the sample and the weights of each classifier are given by

our posterior distribution p(W) given X, Y.

So this is how it should work in theory, but in practice of course that is not so,

and the problem is in these two integrals.

So they're usually intractable, they're usually in huge dimensional spaces.

For example, for the case of deep learning,

the dimensionality of W can be about tens of millions of per mix.

And since the integrals are intractable,

we cannot even very roughly approximate them.

This was the reason why, until very recently,

Bayesian methods were considered to be not scalable.

The situation has changed with the development of so

called stochastic variational inference.

And sort of trying to solve basic inference problem exactly,

that is to find true [INAUDIBLE] distribution.

P(w) given X, Y [INAUDIBLE] distribution from some parametric family,

so the distribution is q of w given five.

[INAUDIBLE] Approximated by minimizing some kind of distance measure between

the two distributions between our rational approximation and the two posterior.

There can be different distance measures, but

one of the most popular one is so-called KL divergence between q and p.

As was mentioned in previous lectures, in this case, the optimization problem

is exactly equivalent to maximizing so-called ELBO, or evidence lower bound.

So ELBO itself is an integral, again, in a huge dimensional space.

And the integral itself is still untaxable.

But we do not need to compute the integral exactly.

All we need to do is to optimize the integral with the [INAUDIBLE].

And surprisingly,

it appears that this is possible using stochastic optimization framework.

So our ELBO actually has several very nice purposes.

One of them is the obtaining of likelihood, p of Y given X, W.

It's inside a logarithm.

This means that it can be split into the sum of individual likelihoods of obtaining

objects.

And this means that additionalization of our optimization,

we do not need to compute the whole training of likelihood.

We can compute only its unbiased estimate given by a tiny mini batch of data.

So this means that ELBO supports mini batching.

Another good property is that ELBO still expectation with respect to

our relational approximation.

And this means that, we need to obtain unbiased stochastic gradient.

We could remove this integral with its unbiased Monte Carlo estimate.

For this purpose, we also need to perform reparameterization trick in order to

reduce the variance of stochastic gradient.

So then we may simply sample from the distribution, which is now parameter-free.

And use Monte Carlo estimate to compute gradients.

Another good property is that the richer is variational family,

the better we approximate the true posterior distribution,

so we do not have a risk of overfitting.

The more operational parameters we have,

the closer we are to the true posterior distribution.

And finally,

we can split ELBO into two parts by splitting our arithmetic side of integral.

So then there will be two integrals, two terms.

The first term is called data term, and it simply shows the expectation with respect

to our relational approximation of training a likelihood.

And the second term is negative KL diversions between our relational

approximation and the prior distribution.

Even now if I get the second term,

we optimize just the first term with respect all possible distributions,

we'll end up with a delta function at maximal group point.

So there'll be a delta function at WML.

And the second term,

the regularizer prevents us from collapsing to delta function.

It penalizes too much deviations from prior distribution.

It optimize both terms with respect to all possible distributions,

we'll end up with a true posterior distribution.

But since the true posterior is untraceable,

we'll limit the set of possible relational approximations.

And then they'll end up with a variational distribution, which is the closest in

terms of KL divergence to our true posterior distribution.

So to conclude, this is stochastic relational inference.

This is a highly scalable technique that provides us with the person with

Bayesian inference.

The usage of stochastic optimization and

reparameterization trick makes SVI very applicable to large data sets.

And in the next section, I will tell you about dropout and

how it can be interpreted from Bayesian point of view.