[SOUND] This
lecture is about the basic measures for evaluation of text retrieval systems.
In this lecture, we're going to discuss how we design basic
measures to quantitatively compare two retrieval systems.
This is a slide that you have seen earlier in the lecture where we talked
about the Granville evaluation methodology.
We can have a test faction that consists of queries, documents, and [INAUDIBLE].
We can then run two systems on these data sets to contradict the evaluator.
Their performance.
And we raise the question, about which set of results is better.
Is system A better or is system B better?
So let's now talk about how to accurately quantify their performance.
Suppose we have a total of 10 relevant documents in the collection for
this query.
Now, the relevant judgments show on the right in [INAUDIBLE] obviously.
And we have only seen 3 [INAUDIBLE] there, [INAUDIBLE] documents there.
But, we can imagine there are other Random documents in judging for this query.
So now, intuitively, we thought that system
A is better because it did not have much noise.
And in particular we have seen that among the three results,
two of them are relevant but in system B,
we have five results and only three of them are relevant.
So intuitively it looks like system A is more accurate.
And this infusion can be captured by a matching holder position,
where we simply compute to what extent all the retrieval results are relevant.
If you have 100% position,
that would mean that all the retrieval documents are relevant.
So in this case system A has a position of two out of
three System B has some sweet hold of 5 and
this shows that system A is better frequency.
But we also talked about System B might be prefered by some other units
would like to retrieve as many random documents as possible.
So in that case we'll have to compare the number of relevant documents that they
retrieve and there's another method called recall.
This method uses the completeness of coverage of random documents
In your retrieval result.
So we just assume that there are ten relevant documents in the collection.
And here we've got two of them, in system A.
So the recall is 2 out of 10.
Whereas System B has called a 3, so it's a 3 out of 10.
Now we can see by recall system B is better.
And these two measures turn out to be the very basic of measures for
evaluating search engine.
And they are very important because they are also widely used in many
other test evaluation problems.
For example, if you look at the applications of machine learning,
you tend to see precision recall numbers being reported and for all kinds of tasks.