In the previous video,

you saw the basic blocks of implementing a deep neural network.

A forward propagation step for each layer,

and a corresponding backward propagation step.

Let's see how you can actually implement these steps.

We'll start with forward propagation.

Recall that what this will do is input a[l-1] and output a[l],

and the cache z[l].

And we just said that an implementational point of view,

maybe where cache w[l] and b[l] as well,

just to make the functions come a bit easier in the problem exercise.

And so, the equations for this should already look familiar.

The way to implement a forward function is just this equals w[l] * a[l-1] + b[l],

and then, a[l] equals deactivation function applied to z.

And if you want to vectorize implementation,

then it's just that times a[l-1] + b,

adding b, being a hyper-broadcasting,

and a[l] = g applied element-wise to z.

And you remember, on the diagram for the forward step,

remember we had this chain of boxes going forward,

so you initialize that with feeding an a[0],

which is equal to X.

So, you initialized this.

Really, what is the input to the first one, right?

It's really a[0] which is the input features to either for one training sample,

if you're doing one example at a time,

or A[0], the entire training set,

if you are processing the entire training set.

So that's the input to the first four functions in the chain,

and then just repeating this allows you to

compute forward propagation from left to right.

Next, let's talk about the backward propagation step.

Here, your goal is to input da[l],

and output da[l-1] and dW[l] and db.

Let me just right out the steps you need to compute these things: dz[l] = da[l],

element-wise product with g[l]` z[l],

and then, to compute the derivatives

dW[l] = dz[l] * a[l - 1].

I didn't explicitly put that in the cache but it turns out,

you need this as well.

And then, db[l] = dz[l], and finally,

da[l-1] = w[l]_transpose * dz[l], okay?

And, I don't want to go through the detailed derivation for this,

but it turns out that if you take this definition for da and plug it in here,

then you get the same formula as we had in the previous week,

for how you compute dz[l] as a function of the previous dz[l], in fact, well,

If I just plug that in here,

you end up that dz[l] = w[l+1]_transpose dz[l+1] * g[l]` z[l].

I know this looks like a lot of algebra,

You can actually double check for yourself that

this is the equation we have written down for

back propagation last week when we

are doing a neural network with just a single hidden layer.

And as reminder, this time, this element-wise product,

and so all you need is those four equations to implement your backward function.

And then finally, I'll just write out a vectorized version.

So the first line becomes dz[l] = dA[l],

element-wise product with g[l]` of z[l].

Maybe no surprise there.

dW[l] becomes 1/m, dz[l] * a[l-1]_transpose and then,

db[l] becomes 1/m np.sum dz[l],

then, axis = 1, keepdims = true.

We talked about the use of np.sum in the previous week to compute db.

And then finally, dA[l-1] is W[l]_transpose * dz[l].

So this allows you to input this quantity, da, over here,

and output dW[l], db[l],

the derivatives you need,

as well as dA[l-1], right? As follows.

So that's how you implement the backward function.

So just to summarize,

take the input X,

you might have the first layer,

maybe has a ReLU activation function.

Then go to the second layer,

maybe uses another ReLU activation function,

goes to the third layer,

maybe has a Sigmoid activation function if you're doing binary classification,

and this outputs y_hat.

And then, using y_hat,

you can compute the loss,

and this allows you to start your backward iteration.

I'll draw the arrows first, okay?

So I don't have to change pens too much.

Where you will then have back-prop compute the derivatives, to compute

dW[3], db[3], dW[2], db[2], dW[1], db[1],

and along the way you would be computing,

I guess, the cache would transfer z[1], z[2], z[3],

and here you pause backward da[2] and da[1].

This could compute da[0],

but we won't use that.

So you can just discard that, right?

And so, this is how you implement forward-prop and back-prop for

a three layer neural network.

Now, there's just one last detail that I didn't

talk about which is for the forward recursion,

we will initialize it with the input data X.

How about the backward recursion?

Well, it turns out that da[l],

when you're using logistic regression,

when you're doing binary classification,

is equal to y/a + 1-y/1-a.

So it turns out that the derivative for the loss function,

respect to the output,

with respect to y_hat, can be shown to be what it is.

If you're familiar with calculus,

If you look up the loss function l,

and take the riveters, respect to y_hat or respect to a,

you can show that you get that formula.

So this is the formula you should use for da for the final layer, capital L.

And of course, if you were to have a vectorized implementation,

then you initialize the backward recursion,

not with this but with dA for the layer l,

which is going to be the same thing for the different examples,

over a, for the first training example, + 1-y,

for the first training example,

over 1-a, for the first training example,

...down to the end training example, so 1-a[m].

So that's how you implement the vectorized version.

That's how you initialize the vectorized version of back propagation.

So you've now seen the basic building blocks of

both forward propagation as well as back propagation.

Now, if you implement these equations,

you will get a correct implementation of

forward-prop and back-prop to get you the derivatives you need.

You might be thinking, while there was a lot of equation,

I'm slightly confused, I'm not quite sure I see how this works.

And if you're feeling that way, my advice is,

when you get to this week's programming assignment,

you will be able to implement these for yourself,

and they will be much more concrete.

And I know there is lot of equations,

and maybe some equations didn't make complete sense,

but if you work through the calculus,

and the linear algebra, which is not easy,

so feel free to try,

but that's actually one of the more difficult derivations in machine learning.

It turns out the equations roll down,

or just the calculus equations for computing the derivatives specially in back-prop.

But once again, if this feels a little bit abstract,

a little bit mysterious to you,

my advice is, when you've done the primary exercise,

it will feel a bit more concrete to you.

Although I have to say, even today,

when I implement a learning algorithm, sometimes,

even I'm surprised when

my learning algorithms implementation works and it's because a lot of

complexity of machine learning comes from the data rather than from the lines of code.

So sometimes, you feel like,

you implement a few lines of code,

not quite sure what it did,

but this almost magically works,

because a lot of magic is actually not in the piece of code you write,

which is often not too long.

It's not exactly simple,

but it's not ten thousand,

a hundred thousand lines of code,

but your feeding so much data that sometimes,

even though I've worked in machine learning for a long time,

sometimes, it still surprises me a bit when

my learning algorithm works because a lot of complexity of your learning algorithm

comes from the data rather than

necessarily from your writing thousands and thousands of lines of code.

All right. So, that's how you implement deep neural networks.

And again, this will become more concrete when you done the primary exercise.

Before moving on, in the next video,

I want to discuss hyper parameters and parameters.

It turns out that when you're training deep nets,

being able to organize your hyper parameters well

will help you be more efficient in developing your networks.

In the next video, let's talk about exactly what that means.