0:56

How can we use precision-recall to evaluate, a ranked list?

Â Well, naturally, we have to look after the precision-recall at different, cut-offs.

Â Because in the end, the approximation of relevant documents, set,

Â given by a ranked list, is determined by where the user stops browsing.

Â Right? If we assume the user, securely browses,

Â the list of results, the user would, stop at some point, and

Â that point would determine the set.

Â And then, that's the most important, cut-off,

Â that we have to consider, when we compute the precision-recall.

Â Without knowing where exactly user would stop,

Â then we have to consider, all the positions where the user could stop.

Â So, let's look at these positions.

Â Look at this slide, and then, let's look at the,

Â what if the user stops at the, the first document?

Â What's the precision-recall at this point?

Â What do you think?

Â 1:56

Well, it's easy to see, that this document is So, the precision is one out of one.

Â We have, got one document, and that's relevent.

Â What about the recall?

Â Well, note that, we're assuming that, there are ten relevant documents, for

Â this query in the collection, so, it's one out of ten.

Â 2:49

Well, it's two out of three, right?

Â And, recall is the same, two out of ten.

Â So, when would see another point, where the recall would be different?

Â Now, if you look down the list, well, it won't happen until,

Â we have, seeing another relevant document.

Â In this case D5, at that point, the, the recall is increased through

Â three out of ten, and, the precision is three out of five.

Â 3:15

So, you can see, if we keep doing this, we can also get to D8.

Â And then, we will have a precision of four out of eight,

Â because there are eight documents, and four of them are relevant.

Â And, the recall is a four out of ten.

Â 3:29

Now, when can we get, a recall of five out of ten?

Â Well, in this list, we don't have it, so, we have to go down on the list.

Â We don't know, where it is?

Â But, as convenience, we often assume that, the precision is zero,

Â 3:47

at all the, the othe, the precision are zero at

Â all the other levels of recall, that are beyond the search results.

Â So, of course, this is a pessimistic assumption,

Â the actual position would be higher, but we make, make this assumption,

Â 4:22

But, this is okay, for the purpose of comparing to, text methods.

Â And, this is for the relative comparison, so, it's okay, if the actual measure,

Â or actual, actual number deviates a little bit, from the true number.

Â As long as the deviation,

Â is not biased toward any particular retrieval method, we are okay.

Â We can still, accurately tell which method works better.

Â And, this is important point, to keep in mind.

Â When you compare different algorithms,

Â the key's to avoid any bias toward each method.

Â And, as long as, you can avoid that.

Â It's okay, for you to do transformation of these measures anyway, so,

Â you can preserve the order.

Â 5:09

Okay, so, we'll just talk about,

Â we can get a lot of precision-recall numbers at different positions.

Â So, now, you can imagine, we can plot a curve.

Â And, this just shows on the, x-axis, we show the recalls.

Â 5:23

And, on the y-axis, we show the precision.

Â So, the precision line was marked as .1, .2, .3, and, 1.0.

Â Right? So,

Â this is, the different, levels of recall.

Â And,, the y-axis also has, different amounts, that's for precision.

Â 5:45

So, we plot the, these, precision-recall numbers, that we have got,

Â as points on this picture.

Â Now, we can further, and link these points to form a curve.

Â As you'll see,

Â we assumed all the other, precision as the high-level recalls, be zero.

Â And, that's why, they are down here, so, they are all zero.

Â And this, the actual curve probably will be something like this, but, as we just

Â discussed, it, it doesn't matter that much, for comparing two methods.

Â 6:58

see same level of recall here, and you can see,

Â the precision point by system A is better, system B.

Â So, there's no question.

Â In here, you can imagine, what does the code look like, for ideal search system?

Â Well, it has to have perfect, precision at all the recall points, so,

Â it has to be this line.

Â That would be the ideal system.

Â In general, the higher the curve is, the better, right?

Â The problem is that, we might see a case like this.

Â This actually happens often.

Â Like, the two curves cross each other.

Â 7:38

Now, this is a real problem, that you actually, might have face.

Â Suppose, you build a search engine, and you have a old algorithm,

Â that's shown here in blue, or system B.

Â And, you have come up with a new idea.

Â And, you test it.

Â And, the results are shown in red, curve A.

Â 8:05

Or more, practically, do you have to replace the algorithm that

Â you're already using, your, in your search engine, with another, new algorithm?

Â So, should we use system, method A, to replace method B?

Â This is going to be a real decision, that you to have to make.

Â If you make the replacement, the search engine would behave like system A here,

Â whereas, if you don't do that, it will be like a system B.

Â So, what do you do?

Â 8:36

Now, if you want to spend more time to think about this, pause the video.

Â And, it's actually very useful to think about that.

Â As I said, it's a real decision that you have to make, if you are building your own

Â search engine, or if you're working, for a company that, cares about the search.

Â 8:52

Now, if you have thought about this for

Â a moment, you might realize that, well, in this case, it's hard to say.

Â Now, some users might like a system A, some users might like, like system B.

Â So, what's the difference here?

Â Well, the difference is just that, you know,

Â in the, low level of recall, in this region, system B is better.

Â There's a higher precision.

Â But in high recall region, system A is better.

Â 9:20

Now, so, that also means, it depends on whether the user

Â cares about the high recall, or low recall, but high precision.

Â You can imagine, if someone is just going to check out, what's happening today, and

Â want to find out something relevant in the news.

Â 9:47

On the other hand, if you think about a case,

Â where a user is doing you are, starting a problem.

Â You want to find, whether your idea ha, has been started before.

Â In that case, you emphasize high recall.

Â So, you want to see, as many relevant documents as possible.

Â Therefore, you might, favor, system A.

Â So, that means, which one is better?

Â That actually depends on users, and more precisely, users task.

Â 10:29

You have to look at the overall picture.

Â Yet, as I said, when you have a practical decision to make,

Â whether you replace ours with another,

Â then you may have to actually come up with a single number, to quantify each, method.

Â Or, when we compare many different methods in research, ideally, we have

Â one number to compare, them with, so, that we can easily make a lot of comparisons.

Â So, for all these reasons, it is desirable to have one, single number to match it up.

Â So, how do we do that?

Â And, that, needs a number to summarize the range.

Â So, here again it's the precision-recall curve, right?

Â And, one way to summarize this whole ranked, list, for

Â this whole curve, is look at the area underneath the curve.

Â 11:26

this particular way of matching it has been very, popular, and

Â has been used, since a long time ago for text And, this is,

Â basically, in this way, and it's called the average precision.

Â Basically, we're going to take a, a look at the, every different, recall point.

Â 11:47

And then, look out for the precision.

Â So, we know, you know, this is one precision.

Â And, this is another, with, different recall.

Â Now, this, we don't count to this one,

Â because the recall level is the same, and we're going to, look at the,

Â this number, and that's precision at a different recall level et cetera.

Â So, we have all these, you know, added up.

Â These are the precisions at the different points,

Â corresponding to retrieving the first relevant document, the second, and

Â then, the third, that follows, et cetera.

Â Now, we missed the many relevant documents, so, in all of those cases,

Â we just, assume, that they have zero precisions.

Â 13:02

Right, so, if we, we divide this by four, it's actually not very good.

Â In fact, that you are favoring a system, that would retrieve very few random

Â documents, as in that case, the denominator would be very small.

Â So, this would be, not a good matching.

Â So, note that this denomina, denominator is ten,

Â the total number of relevant documents.

Â And, this will basically ,compute the area, and the needs occur.

Â And, this is the standard method, used for evaluating a ranked list.

Â 13:41

Note that, it actually combines recall and, precision.

Â But first, you know, we have precision numbers here, but secondly,

Â we also consider recall, because if missed many, there would be many zeros here.

Â All right, so, it combines precision and recall.

Â And furthermore, you can see this measure is sensitive to a small change

Â of a position of a relevant document.

Â Let's say, if I move this relevant document up a little bit, now,

Â it would increase this means, this average precision.

Â Whereas, if I move any relevant document, down, let's say, I move this relevant

Â document down, then it would decrease, uh,the average precision.

Â So, this is a very good,

Â because it's a very sensitive to the ranking of every relevant document.

Â It can tell, small differences between two ranked lists.

Â And, that is what we want,

Â sometimes one algorithm only works slightly better than another.

Â And, we want to see this difference.

Â In contrast, if we look at the precision at the ten documents.

Â If we look at this, this whole set, well,

Â what, what's the precision, what do you think?

Â Well, it's easy to see, that's a four out of ten, right?

Â So, that precision is very meaningful, because it tells us, what user would see?

Â So, that's pretty useful, right?

Â So, it's a meaningful measure, from a users perspective.

Â But, if we use this measure to compare systems, it wouldn't be good,

Â because it wouldn't be sensitive to where these four relevant documents are ranked.

Â If I move them around the precision at ten, still, the same.

Â Right. So,

Â this is not a good measure for comparing different algorithms.

Â In contrast, the average precision is a much better measure.

Â It can tell the difference of, different,

Â a difference in ranked list in, subtle ways.

Â [MUSIC]

Â