If you're building a computer vision application rather than
training the ways from scratch, from random initialization,
you often make much faster progress if you download
ways that someone else has already trained on
the network architecture and use that as pre-training
and transfer that to a new task that you might be interested in.
The computer vision research community has been pretty good at posting lots of
data sets on the Internet so if you hear of things like Image Net, or MS COCO,
or Pascal types of data sets,
these are the names of different data sets that people have post
online and a lot of computer researchers have trained their algorithms on.
Sometimes these training takes several weeks and might take many GP use
and the fact that someone else has done
this and gone through the painful high-performance search process,
means that you can often download open source ways that took someone else many weeks or
months to figure out and use that as
a very good initialization for your own neural network.
And use transfer learning to sort of transfer knowledge from some of
these very large public data sets to your own problem.
Let's take a deeper look at how to do this.
Let's start with the example,
let's say you're building a cat detector to recognize your own pet cat.
According to the internet,
Tigger is a common cat name and Misty is another common cat name.
Let's say your cats are called Tiger and Misty and there's also neither.
You have a classification problem with three clauses.
Is this picture Tigger,
or is it Misty, or is it neither.
And in all the case of both of you cats appearing in the picture.
Now, you probably don't have a lot of pictures of Tigger
or Misty so your training set will be small.
What can you do?
I recommend you go online and download some open source implementation of
a neural network and download not just the code but also the weights.
There are a lot of networks you can download that have been trained on for example,
the Init Net data sets which has a thousand different clauses so the network
might have a softmax unit that outputs one of a thousand possible clauses.
What you can do is then get rid of the softmax layer and create
your own softmax unit that outputs Tigger or Misty or neither.
In terms of the network,
I'd encourage you to think of all of these layers as
frozen so you freeze the parameters in
all of these layers of the network and you would then just
train the parameters associated with your softmax layer.
Which is the softmax layer with three possible outputs,
Tigger, Misty or neither.
By using someone else's free trade ways,
you might probably get pretty good performance on this even with a small data set.
Fortunately, a lot of people learning frameworks
support this mode of operation and in fact,
depending on the framework it might have things like trainable parameter equals zero,
you might set that for some of these early layers.
In others they just say,
don't train those ways or sometimes you have a parameter
like freeze equals one and these are
different ways and different deep learning program frameworks that let you
specify whether or not to train the ways associated with particular layer.
In this case, you will train
only the softmax layers ways but freeze all of the earlier layers ways.
One other neat trick that may help for some implementations
is that because all of these early leads are frozen,
there are some fixed function that doesn't change because you're not changing it,
you not training it that takes this input image acts and
maps it to some set of activations in that layer.
One of the trick that could speed up training is we just pre-compute that layer,
the features of re-activations from that layer and just save them to disk.
What you're doing is using this fixed function,
in this first part of the neural network,
to take this input any image X and compute some feature vector for it and then you're
training a shallow softmax model from this feature vector to make a prediction.
One step that could help your computation as you just pre-compute that layers activation,
for all the examples in training sets and save them to
disk and then just train the softmax clause right on top of that.
The advantage of the safety disk or
the pre-compute method or the safety disk is that you don't need to
recompute those activations everytime
you take a epoch or take a post through a training set.
This is what you do if you have a pretty small training set for your task.
Whether you have a larger training set.
One rule of thumb is if you have
a larger label data set so maybe you just have a ton of pictures of Tigger,
Misty as well as I guess pictures neither of them,
one thing you could do is then freeze fewer layers.
Maybe you freeze just these layers and then train these later layers.
Although if the output layer has different clauses then you need to have
your own output unit any way Tigger, Misty or neither.
There are a couple of ways to do this.
You could take the last few layers ways
and just use that as initialization and do gradient descent
from there or you can also blow away these last few layers and
just use your own new hidden units and in your own final softmax outputs.
Either of these matters could be worth trying.
But maybe one pattern is if you have more data,
the number of layers you've freeze could be smaller and then
the number of layers you train on top could be greater.
And the idea is that if you pick a data set and maybe have
enough data not just to train a single softmax unit but to train
some other size neural network that comprises
the last few layers of this final network that you end up using.
Finally, if you have a lot of data,
one thing you might do is take this open source network and ways and use
the whole thing just as initialization and train the whole network.
Although again if this was a thousand of softmax and you have just three outputs,
you need your own softmax output.
The output of labels you care about.
But the more label data you have for
your task or the more pictures you have of Tigger, Misty and neither,
the more layers you could train and in the extreme case,
you could use the ways you download just as
initialization so they would replace
random initialization and then could do gradient descent,
training updating all the ways and all the layers of the network.
That's transfer learning for the training of ConvNets.
In practice, because the open data sets on the internet are so big and the ways you
can download that someone else has spent weeks training has learned from so much data,
you find that for a lot of computer vision applications,
you just do much better if you download
someone else's open source ways and use that as initialization for your problem.
In all the different disciplines,
in all the different applications of deep learning,
I think that computer vision is one where transfer learning is
something that you should almost always do unless,
you have an exceptionally large data set to train everything else from scratch yourself.
But transfer learning is just very worth seriously considering unless you have
an exceptionally large data set and a very large computation budget
to train everything from scratch by yourself.