0:00

[MUSIC]

Â Hi, welcome to our course.

Â In this first video,

Â we will see basic principles that we'll use throughout this course.

Â Let's learn them by example.

Â Imagine you are running through a park and you see another man running.

Â And you ask yourself, why is he running?

Â And you come up with four different explanations.

Â First, he is in a hurry.

Â Second, he is doing some sports.

Â Third, he always runs.

Â And fourth, he saw a dragon.

Â Principle 1, use prior knowledge.

Â From our previous experience we know that dragons do no exist.

Â And so, we can exclude fourth option from next consideration.

Â Principle 2, choose answer that explains observations the most.

Â Imagine you saw that he is not wearing a sports suit.

Â In this case, itÂ´s very unlikely that heÂ´s doing sports, and so

Â we can exclude number two.

Â Principle 3, avoid making extra assumptions.

Â From the last two options, the third option, does he always runs,

Â makes a lot of extra assumptions and so should exclude it.

Â This principle is also known as outcomes racer.

Â And finally, we are left with only one case, that he is in a hurry.

Â To conclude, we've seen three principles.

Â To use prior knowledge, to choose answer that explains observations the most, and

Â finally to avoid making extra assumptions.

Â 1:27

Before we continue, let's review some basic principles from probability theory.

Â We define probability in the following way.

Â Imagine you have some source of randomness, for example, a dice.

Â And you repeat an experiment multiple times.

Â And as the number of experiments goes to infinity,

Â we get the probability as a fraction of the times some event occurred.

Â For example, you would expect for a fair dice that the event that you

Â threw five would have a frequency about one-sixth.

Â And for events that you threw an odd number,

Â it would be somewhere around one-half.

Â We will consider two different types of random variables depending on which values

Â they can take, discrete and continuous.

Â The discrete for random variables can have either finite number of values that can

Â take, as for example, for a dice.

Â Or infinite, if you count the number of times that some certain event happened.

Â An example of continuous random variable would be at tomorrow's temperature.

Â The most convenient way to find the discrete distribution is to call

Â the probability mass function.

Â It maps a number before each point that refers to the probability.

Â For example, in this case,

Â we'll get a point that equals to 1 which produces in 0.2.

Â The 0.3 with probability 0.5 and so on with probability 0.3 and

Â other points with probability 0.

Â Also note that these points sum up to 1.

Â The most convenient way to define continuous distributions is called

Â a probability density function.

Â It assigns a non-negative value for each point.

Â And then to compute the probability that a point will fall into some range, for

Â example, from a to b, you should integrate this function over this given range.

Â As is given on the slide.

Â We will also need a notion of independence.

Â The two run variables are considered independent if their joint probability,

Â that is, a probability of X and Y, equals to the product of their marginals.

Â So it will be a probability of X times a probability of Y.

Â Let's see an example.

Â Imagine that you have a deck of 52 cards and you take, randomly, 2 cards from it.

Â And the first random variable would be the picture that is drawn on

Â the first card and second would be the picture that is drawn on the second card.

Â Those kind of variables are dependent since it is impossible to

Â take one card two times.

Â Another example is throwing two coins independently.

Â Here the probability that the first coin will land heads up and

Â the second would land tails up equals to the product of the two probabilities.

Â And so these random variables are independent.

Â 4:09

The last thing we'll need is a conditional probability.

Â We want to answer a question,

Â what is the probability of X given that something that is called Y happened.

Â It is given by the formula that you can see on the slide.

Â It is the probability of X given Y equals to the joint probability P of X and

Â Y over the marginal probability P of Y.

Â Let's consider an example.

Â Imagine you are a student and you want to pass some course.

Â It has two exams in it, a midterm and the final.

Â The probability that the student will pass a midterm is 0.4 and

Â the probability that the student will pass a midterm and the final 0.25.

Â If you want to find the probability that you will pass the final, given that you

Â already passed the midterm, you can apply the formula from the previous slide.

Â And this will give you a value around 60%.

Â We'll need two tricks to deal with formulas.

Â The first is called the chain rule.

Â We can derive it from the definition of the conditional probability.

Â That is, the joint probability of X and

Â Y equals to the product of X given Y and the probability of Y.

Â By induction, we can prove the same formula for three variables.

Â It will be the probability of X, Y, and Z equals to probability of X given Y and

Â Z, the probability of Y given Z, and finally probability of Z.

Â And in a similar way, we can obtain the formula for

Â the arbitrary number of points.

Â So this would be the probability of the current point,

Â given all its previous points.

Â The last rule is called the sum rule.

Â That is, if you want to find out the marginal distribution p(X), and

Â you know only the joint probability that p(X,Y),

Â you can integrate out the random variable Y, as it is given on the formula.

Â And finally, the most important formula for this course, the Bayes theorem.

Â We want to find out the probability of theta given X,

Â where theta are the parameters of our model.

Â For example, we have a neural network and those are its parameters.

Â And then we have X.

Â Those are the observations, for example, the images that you are dealing with.

Â From the definition of the conditional probability, we can say that it is a ratio

Â between the joint probability and the marginal probability, P(X).

Â And also we apply the chain rule, we'll get the following formula.

Â It will be the probability of X given theta,

Â times the probability of theta over probability of X.

Â This formula is so important that each of its components has its own name.

Â The probability of theta is called a prior,

Â it shows us what prior knowledge we know about the parameters.

Â For example, you can know that some parameters are distributed at around 0.

Â The term probability of X given theta is called a likelihood, and

Â it shows how well the parameters explain our data.

Â The thing that we get, the probability of theta given X, is called a posterior, and

Â it is the probability of the parameters after we observe the data.

Â And finally the term in the denominator is called evidence

Â [MUSIC]

Â