In this video, we will learn to extract the semantic features
used as digital signatures for retrieval of similar images.
We will start with somewhat less sophisticated approaches and trace
the development of semantic features up to more recent results.
To search for a target image,
one first needs to compute a semantic representation from raw pixel values.
Virtually, every approach to computing features for
image classification and recognition has been applied to image retrieval too.
Starting from basic color histograms,
the research then went on to using gradient histograms,
such as HOG or SIFT.
Not surprisingly, features extracted from
deep convolutional neural networks have recently
gained attention as an effective image representation.
One of the first content-based image retrieval systems
was the QBIC system developed at IBM in 1995.
It worked in two mode: image search using
color histograms and image search using object marks specified by user.
The second mode was possible,
thanks to the existence of a hand label database of objects,
where objects attributes such as dimensions,
area or number of objects were marked.
QBIC was also implemented to search through the
10,000 images of paintings from the Hermitage Museum in Saint Petersburg.
It worked in two modes.
Histogram of radiance are features used for image classification.
They are computed by splitting the image into blocks,
computing the gradients in each block and aggregating these into histograms.
As the histogram is computed for a fixed image resolution,
such a descriptor is not ideal for a content-based image search,
where objects of interest may be present in the image at different scales.
Therefore, to describe the image as a whole,
GIST descriptor was introduced.
It computes gradients in the image at a variety of scales
via primitive scales or Gaussian smoothing with different intensities.
For each scale, histograms of gradients are
concatenated to form a descriptor to describe color information,
while one may use color histogram or simple averaging of colors within blocks.
The whole intermediate image representations learned by
deep convolutional neural networks may be used
to solve a variety of tasks and computer vision.
One of the most straightforward application is,
in fact, retrieval of similar images.
Let us consider the feature activations induced by an image at
the last 4096 dimensional hidden layer.
If two images produce feature activation vectors with a small Euclidean separation,
we can say that the higher levels of the neural network consider them to be similar.
Figure on the slide shows nine images from the image net collection used as
queries with more images that are most similar to each of them according to this measure.
They notice that at the pixel level,
the retrieve training images are generally not close in
the Euclidean distance to the query average images in the first column.
For example, the retrieved ships and elephants appear in a variety of poses.
Moreover, the elephant class may have never been present in the original training data.
The crude image search algorithm
that uses deep convolutional neural networks has the following form.
Let us fix some layer in the network
whose activations are to be used as semantic image features.
For instance, we may use the output of layers five,
six or seven in the AlexNet network prior to the ReLu transform.
Naturally, each of these high-dimensional vectors represent
a deep descriptor or a neural code of an input image.
Pooling for each image,
we extract multiple sub-patches of
different sizes at different locations whose union covers the whole image.
For each extracted sub-patch,
we computed CNN representation,
and we compute distances between
each queries sub-patch and the reference image sub-patch.
We found the distance between the reference and the query image is computed as
the average distance of each query sub-patch to that of the reference image.
We extract all the methods,
except the convolutional neural networks, the whole image.
In this slide have the representation trained
on datasets similar to those they report the results on,
while the CNN is trained on a different dataset.
Yet the results, the whole image.
Depicted in the table are competitive,
and it can be seen that CNN features,
when compared to low memory footprint methods,
produce consistently higher results for retrieval.
The natural question arising when using
convolutional neural networks as a feature extractor is
activations of which layer should we employ as the features for image retrieval.
In the benchmarks shown on that slide,
the whole image query is submitted against the dataset of natural images,
and the goal is to retrieve images that were taken at the same location on Earth.
Correct answers are outlined in green.
The research suggests that layers earlier in the network,
that is for instance the last convolutional layer,
are better suited for retrieval of similarly textured scenes such as foamy waterfalls,
presumably because of their reliance on relatively low-level texture features,
rather than high level concepts.
On the other hand, representations deeper in the network,
such as activations of the last convolutional layer,
give better retrieval results when searching for objects appearing at a variety of
scales presumably because of their reliance on higher-level concepts.
In this example, a door handle in the shape of
a lions head is the object whose images we're looking for.
We know that only activations from deeper levels
are abstract enough to be able to retrieve such a representation.
Last convolution as the neural codes in
described methods are high-dimensional, for example,
4096 dimensions for sixth layer in the AlexNet network,
albeit less high-dimensional than other state-of-the-art holistic descriptors.
A question of their effective compression arises.
If we extract neural codes on sound collection and perform PCA,
it turns out to work surprisingly well.
Thus the neural codes can be compressed to
256 or even 198 dimensions almost without any quality loss.
An even better representation for image retrieval is based
on the aggregation of raw deep convolutional features,
without any kind of embedding,
such as linear or non-linear mapping that may
be called sum-pooled convolutional features.
Each deep convolutional feature F,
computed from image I is associated with the spatial coordinates X and Y,
corresponding to the spatial position of the features in the map.
Thus the new reduced by last convolutional layer.
To summarize, historically,
many handcrafted image descriptors were employed for the purposes of retrieval.
Activations that are induced in
convolutional neural networks reflect
similarity of images at different levels of abstraction.
We can compress neural codes for efficient retrieval of similar images,
and that would work even better than
compressing state-of-the-art handcrafted descriptors.