Congratulations on getting to the end of the clustering and retrieval course.

We've covered a lot of ground in this course and

in addition a lot of really advanced material.

But through this course, you've learned some very practical algorithms for

performing clustering and retrieval, and you should be able to actually

go out there, implement these methods, and deploy them on real world problems.

But before we wrap up, let's spend a little time reflecting on what we've

learned in this course and look ahead at what's left in this specialization.

Okay, well, what have you learned in this course?

To start with, in module one, we looked at the idea of retrieval and

cast that as a problem of performing nearest neighbor search.

And the first algorithm we looked at was something that we called one nearest

neighbor search, where we just look at all of our data points, and look for

the data point that's most similar to the query point.

And we explored this specifically in the context of a document retrievable task,

where there's a whole bunch of documents out there.

There's an entire corpus of documents that we have.

And we have some article that's our query article.

It might be an article that a person is currently reading and is interested in.

And we want to search over all the other articles to find the closest article.

So we presented some algorithmic details behind

performing one nearest neighbor search.

And then we presented a very straightforward extension of one

nearest neighbor search, which is k nearest neighbor searching.

Where we simply return the k most similar articles or, generically,

data points, to a given query point, instead of just the nearest neighbor.

But in both one nearest neighbor and k nearest neighbor search,

there are two really critical elements to the performance of the method.

The first is how we think about representing our data.

So in this case, in our case study that we're considering, the question is,

how are we going to represent our documents?

And then the second critical element is, how are we going to measure

the similarity, or the distance between two data points?

So again, in our case, two documents.

Well, in this course, we took a deep dive into these two

critical components of nearest neighbor search.

And to begin with, we talked about a document representation based on TF-IDF,

term frequency-inverse document frequency,

where the term frequency is simply counting words in the document.

So we look at our entire vocabulary,

we have a data representation that's exactly the length of the vocabulary.

And in each index of this vector,

we simply count how many times we saw a given word.

But we talked about the fact that this can

bias our nearest neighbor search towards emphasizing very common words.

And so that's typically not desired.

So to account for this, we introduce this inverse document frequency term.

Where that down-weights words that are very commonly

used globally throughout the corpus.

So our TF-IDF representation is simply multiplying our tf vector with this

idf vector, and this trades off between local frequency and global rarity.

We then turn to how are we going to compute the distance

between two different articles.

And the first approach that we talked about was using scaled Euclidean distance

where it was just a straightforward variant of Euclidean distance.

And instead of having equal weights on every component of the vector,

we can specify a set of weights.

So for example, for our documents, we might have a representation that separates

out the title from the abstract from the main body from the conclusion.

And then based on this representation across these different components,

we could put weights that more heavily emphasize the words that are appearing in

the title and the abstract than the main body.