0:00

In the last video,

Â you learned about the sliding windows

Â object detection algorithm using a convnet but we saw that it was too slow.

Â In this video, you'll learn how to implement that algorithm convolutionally.

Â Let's see what this means.

Â To build up towards the convolutional implementation of sliding windows let's first see

Â how you can turn fully connected layers in neural network into convolutional layers.

Â We'll do that first on this slide and then the next slide,

Â we'll use the ideas from this slide to show you the convolutional implementation.

Â So let's say that your object detection algorithm inputs 14 by 14 by 3 images.

Â This is quite small but just for illustrative purposes,

Â and let's say it then uses 5 by 5 filters,

Â and let's say it uses 16 of them to map it from 14 by 14 by 3 to 10 by 10 by 16.

Â And then does a 2 by 2 max pooling to reduce it to 5 by 5 by 16.

Â Then has a fully connected layer to connect to 400 units.

Â Then now they're fully connected layer and then finally outputs a Y using a softmax unit.

Â In order to make the change we'll need to in a second,

Â I'm going to change this picture a little bit and instead I'm

Â going to view Y as four numbers,

Â corresponding to the cause probabilities of

Â the four causes that softmax units is classified amongst.

Â And the full causes could be pedestrian,

Â car, motorcycle, and background or something else.

Â Now, what I'd like to do is show how

Â these layers can be turned into convolutional layers.

Â So, the convnet will draw same as before for the first few layers.

Â And now, one way of implementing this next layer,

Â this fully connected layer is to implement this as a 5 by

Â 5 filter and let's use 400 5 by 5 filters.

Â So if you take a 5 by 5 by 16 image and convolve it with a 5 by 5 filter, remember,

Â a 5 by 5 filter is implemented as 5 by 5 by

Â 16 because our convention is that the filter looks across all 16 channels.

Â So this 16 and this 16 must match and so the outputs will be 1 by 1.

Â And if you have 400 of these 5 by 5 by 16 filters,

Â then the output dimension is going to be 1 by 1 by 400.

Â So rather than viewing these 400 as just a set of nodes,

Â we're going to view this as a 1 by 1 by 400 volume.

Â Mathematically, this is the same as a fully connected layer

Â because each of these 400 nodes has a filter of dimension 5 by 5 by 16.

Â So each of those 400 values is

Â some arbitrary linear function of these 5 by 5 by 16 activations from the previous layer.

Â Next, to implement the next convolutional layer,

Â we're going to implement a 1 by 1 convolution.

Â If you have 400 1 by 1 filters then,

Â with 400 filters the next layer will again be 1 by 1 by 400.

Â So that gives you this next fully connected layer.

Â And then finally, we're going to have another 1 by 1 filter,

Â followed by a softmax activation.

Â So as to give a 1 by 1 by 4 volume

Â to take the place of these four numbers that the network was operating.

Â So this shows how you can take these fully connected layers

Â and implement them using convolutional layers so

Â that these sets of units instead are not implemented

Â as 1 by 1 by 400 and 1 by 1 by 4 volumes.

Â After this conversion, let's see how you

Â can have a convolutional implementation of sliding windows object detection.

Â The presentation on this slide is based on the OverFeat paper,

Â referenced at the bottom, by Pierre Sermanet,

Â David Eigen, Xiang Zhang,

Â Michael Mathieu, Robert Fergus and Yann Lecun.

Â Let's say that your sliding windows convnet inputs 14 by 14 by 3 images and again,

Â I'm just using small numbers like the 14 by 14 image

Â in this slide mainly to make the numbers and illustrations simpler.

Â So as before, you have a neural network as

Â follows that eventually outputs a 1 by 1 by 4 volume,

Â which is the output of your softmax.

Â Again, to simplify the drawing here,

Â 14 by 14 by 3 is technically a volume 5 by 5 or 10 by 10 by 16,

Â the second clear volume.

Â But to simplify the drawing for this slide,

Â I'm just going to draw the front face of this volume.

Â So instead of drawing 1 by 1 by 400 volume,

Â I'm just going to draw the 1 by 1 cause of all of these.

Â So just dropped the three components of these drawings, just for this slide.

Â So let's say that your convnet inputs 14 by 14 images or 14 by

Â 14 by 3 images and your tested image is 16 by 16 by 3.

Â So now added that yellow stripe to the border of this image.

Â In the original sliding windows algorithm,

Â you might want to input the blue region into

Â a convnet and run that once to generate a consecration 01 and then slightly down a bit,

Â least he uses a stride of two pixels and then you might slide that to the right by

Â two pixels to input

Â this green rectangle into the convnet and

Â we run the whole convnet and get another label, 01.

Â Then you might input

Â this orange region into the convnet and run it one more time to get another label.

Â And then do it the fourth and final time with this lower right purple square.

Â To run sliding windows on this 16 by 16 by 3 image is pretty small image.

Â You run this convnet four times in order to get four labels.

Â But it turns out a lot of this computation

Â done by these four convnets is highly duplicative.

Â So what the convolutional implementation of sliding windows does is it allows

Â these four pauses in the convnet to share a lot of computation.

Â Specifically, here's what you can do.

Â You can take the convnet and just run it same parameters,

Â the same 5 by 5 filters,

Â also 16 5 by 5 filters and run it.

Â Now, you can have a 12 by 12 by 16 output volume.

Â Then do the max pool, same as before.

Â Now you have a 6 by 6 by 16,

Â runs through your same 400 5 by 5 filters to get now your 2 by 2 by 40 volume.

Â So now instead of a 1 by 1 by 400 volume,

Â we have instead a 2 by 2 by 400 volume.

Â Run it through a 1 by 1 filter gives

Â you another 2 by 2 by 400 instead of 1 by 1 like 400.

Â Do that one more time and now you're left with a

Â 2 by 2 by 4 output volume instead of 1 by 1 by 4.

Â It turns out that this blue 1 by 1 by 4 subset gives

Â you the result of running in the upper left hand corner 14 by 14 image.

Â This upper right 1 by 1 by 4 volume gives you the upper right result.

Â The lower left gives you the results of

Â implementing the convnet on the lower left 14 by 14 region.

Â And the lower right 1 by 1 by 4 volume gives you the same result

Â as running the convnet on the lower right 14 by 14 medium.

Â And if you step through all the steps of the calculation,

Â let's look at the green example,

Â if you had cropped out just this region

Â and passed it through the convnet through the convnet on top,

Â then the first layer's activations would have been exactly this region.

Â The next layer's activation after max pooling would have been

Â exactly this region and then the next layer,

Â the next layer would have been as follows.

Â So what this process does,

Â what this convolution implementation does is,

Â instead of forcing you to run four propagation

Â on four subsets of the input image independently, Instead,

Â it combines all four into one form of computation and shares

Â a lot of the computation in the regions of image that are common.

Â So all four of the 14 by 14 patches we saw here.

Â Now let's just go through a bigger example.

Â Let's say you now want to run sliding windows on a 28 by 28 by 3 image.

Â It turns out If you run four from

Â the same way then you end up with an 8 by 8 by 4 output.

Â And just go small and surviving sliding windows with that 14 by 14 region.

Â And that corresponds to running a sliding windows first on that region thus,

Â giving you the output corresponding the upper left hand corner.

Â Then using a slider too to shift one window over,

Â one window over, one window over and so on and the eight positions.

Â So that gives you this first row and then as you go down the image as well,

Â that gives you all of these 8 by 8 by 4 outputs.

Â Because of the max pooling up too that this corresponds to

Â running your neural network with a stride of two on the original image.

Â So just to recap,

Â to implement sliding windows,

Â previously, what you do is you crop out a region.

Â Let's say this is 14 by 14

Â and run that through your convnet and do that for the next region over,

Â then do that for the next 14 by 14 region,

Â then the next one, then the next one,

Â then the next one, then the next one and so on,

Â until hopefully that one recognizes the car.

Â But now, instead of doing it sequentially,

Â with this convolutional implementation that you saw in the previous slide,

Â you can implement the entire image,

Â all maybe 28 by 28 and convolutionally make all the predictions at

Â the same time by one forward pass through this big convnet

Â and hopefully have it recognize the position of the car.

Â So that's how you implement sliding windows

Â convolutionally and it makes the whole thing much more efficient.

Â Now, this [inaudible] still has one weakness,

Â which is the position of the bounding boxes is not going to be too accurate.

Â In the next video,

Â let's see how you can fix that problem.

Â