[MUSIC]
In this section we will review Dropout, and
review its connections with a Bayesian framework.
So Dropout was invented in 2011, and became popular regularization technique.
We know that it works.
We know that its params are fitting.
And the essence of Dropout is actually just injection of the noise to
the variance, or to the activations at each iteration of our training.
The magnitude of this noise is defined by user, and is usually called dropout rate.
The noise can be different.
For example, it can be Bernoulli noise, then we are talking about binary dropout.
Or it can be Gaussian noise, then we tell about Gaussian dropout.
Let us review Gaussian dropout in details.
At each iteration of training, we generate Gaussian noise with a mean of 1 and
variance Alpha.
Then I multiply each weight Theta ij to Epsilon ij,
Epsilon ij is this noise generated from Gaussian distribution,
and obtain noisified versions of the waves, wij.
Let us consider Gaussian dropout in details.
Additional iteration of training,
we multiply always Theta ij by Gaussian noise, Epsilon ij.
Epsilon ij goes from a Gaussian distribution with a mean of 1 and
variance Alpha.
Then I obtain noisified versions of the weights, wij.
And finally, we compute stochastic gradients of a triangle
gradient given these noisified weights, w.
But then we obtain exactly the same stochastic gradient as it would
be if we optimized the expectation with respect to Gaussian distribution over w.
With weight Theta, and variance Alpha Theta squared, obtaining our likelihood.
So the distribution itself is fully factorized.
To show it, let us first perform a little regularization trick.
So we change the distribution over w to the distribution over Epsilon.
Epsilon now has a min of 1 and variance of Alpha, and it is still fully factorized.
And the old likelihood is computed at the point Theta times Epsilon.
Now our probability density doesn't depend on theta, and
we may move the differentiation inside of our integral.
And then we may change the integral to its Monte Carlo estimate, and
obtain exactly the same expression as it was on the previous slide.
So now we know that Gaussian dropout optimizes the following objective.
This expectation with respect to the distribution over w,
the distribution is fully factorized and its Gaussian, with a mean of Theta ij and
variance Alpha Theta ij squared.
And the expectation is computed from obtaining a likelihood.
So this looks pretty much the same like the first term in ELBO, where as
an regularization approximation we used fully factorized Gaussian distribution.
But where is the second term?
Where is the KL divergence?
Remember that ELBO consists of 2 terms, the data term and
the negative KL divergence, that is our regularizer.
In Gaussian dropout,
we have shown that we are optimizing with respect to Theta of just the first term.
So we managed to find such prior distribution, p(W),
that the second term will depend only on Alpha, and it will not depend on Theta.
Now I've managed to prove that this two procedures are exactly equivalent.
So remember that in Gaussian dropout, Alpha is assumed to be fixed.
And if Alpha is fixed, then optimization of ELBO is equivalent to
the optimization of just the first term with respect to Theta.
Surprisingly such prior distribution exists, and
it is known from information theory.
So this is a so-called improper log-uniform prior.
It is fully factorized again,
and each of its factors is proportional to 1 over absolute value of wij.
This is improper distribution, so it can not be normalized.
Nevertheless, it has several quite nice purpose.
For example, if we consider the algorithm of absolute value of wij, in other words,
it is to show that it will be uniformly distributed from minus to plus infinity.
And again, this is improper probability distribution.
For us, it is important that this prior distribution, roughly speaking,
analysis the precision with which we are trying to find wij.
We may easily show that the counter divergence between our Gaussian numeration
approximation and such kind of prior distribution will be dependent only
on Alpha, and will not depend on Theta.
The counter divergence is still intractable function, but
now it is a function of just one-dimensional parameter Alpha,
and it can be easily approximated by smooth, differentiable function.
So in the figure you see the black dots.
This is the exact values of our KL divergence,
given different values of Alpha.
And the red curve is our smooth, differentiable approximation.
And the existence of this smooth, differential approximation means that
potentially we may optimize the KL divergence with expect to Alpha.
And hence, we optimize the ELBO with respect to both Theta and Alpha.
And this is what we are going to do in the next lecture.
So to conclude, dropouts is a popular regularization technique.
The essence of dropout is simple injection of the noise in each iteration you obtain.
In this lecture, we have shown that one of the popular dropouts,
so-called Gaussian dropout,
is exactly equivalent to a special kind of generalization Bayesian procedure.
And this understanding, the understanding that dropout is a particular case
of Bayesian inference, allows us to construct various generalizations of
dropout that may possess several quite interesting properties.
We'll review one of them in the next lecture.
[MUSIC]