0:02

Now we're going to take a look at what happens when we have significant numbers

Â of duplicate keys which is not at all unusual in practical applications. So, in,

Â in fact, often, the purpose of a sort is to bring items with equal keys together

Â for like the example that I gave where we had cities and time. There's a lot of

Â detailed data and the time and maybe the whole goal of the sort is to group them by

Â cities so we can ship out the data for each city, to each city and there's plenty

Â of other examples like that in data processing where we find maybe remove

Â duplicates from a mailing list or all the job applicants that we get, we might want

Â to sort them by the college attendant. So the sort does huge files with huge numbers

Â of duplicate keys. So, a sort, it's worthwhile to take a careful look at what

Â the implication of that is. So again, typical characteristics we have a huge

Â file but small number of different key values. So let's look at our algorithms in

Â that situation. So Mergesort, it doesn't matter that much what the key values are

Â like and it's actually, we can show that Mergesort always uses between one-half, N

Â log N and N log N compares. Quicksort actually, they're up until the 1990s the

Â most widely used implementation took quadratic time. For files with large

Â numbers of equal keys and that was actually found by applications user and,

Â and that's the standard Quicksort that was in all the textbooks almost all the

Â textbooks if you did not stop the partitioning on equal keys it would run in

Â quadratic time. So we want to do better than this. So the mistake happens if we

Â put all the items equal to the partitioning item on one side which is a

Â natural way to implement it and the consequence is if you have all the keys

Â equal, then partitioning doesn't really do anything. You just peels off one key to do

Â file size n then you get a sub file size n - one and then n - two and so forth and

Â the result is a quadratic tim e algorithm. Our implementation, we stopped the

Â partitioning scans on items equal to the partitioning item and then in that case,

Â when all the keys are equal, it's going to divide it exactly in the middle. And then

Â in that case, when all the keys are equal, it's going to divide at exactly in the

Â middle. And then in that case you get an N log N compares. But actually when you

Â think about it, why don't we just put all the items equal to the partitioning item

Â in place. That's, that's really a desirable way to look at it and let's take

Â a look at that option. So the goal is so called three way partitioning. So what we

Â want to do is get the array into three parts so then now we have two pointers

Â into the middle. One that is the boundary between the keys that are less than the

Â partitioning element and those that are equal of the partitioning element. Another

Â one that's the boundary between the keys that are equal of partitioning elements

Â and the one that is greater. And then in the middle are all the equal keys and

Â that's what we'd like to arrange. Now until the 1990s, conventional wisdom among

Â people implementing system was, wasn't worth doing this. But, but actually it's a

Â problem that Edsger Dijkstra had proposed in the 70s as an example of, of

Â programming problem. He was really interested in analyzing correctness of

Â programs and showing that this how you could convince yourself that this program

Â was operating as expected. But in the 1990s we figured out that really this was

Â going to be an effective way to sort. And this was taking a look at the Qsort that a

Â user found was broken and, and now, this method is incorporated into some plenty of

Â system sorts. So let's take a look at how it works with the demo its more

Â complicated than standard Quicksort partitioning. A bit more complicated

Â because there's more to do. So now we have our i pointer which is right to the left

Â of stuff we haven't seen ye t and then, we have two other pointers that maintain,

Â maintain these boundaries everything to the right of gt is known to be greater

Â than partitioning element. Everything to the left of lt is known to be less and

Â between lt and i is known to be equal. And all the method does is maintain this in

Â variants so let's do an example or two and see how that works. So we increment i and

Â then figure out what to do. So, now it's less than the partitioning element. So,

Â 5:15

what we want to do is increment lt's. So now we have one that's definitely less

Â than the partitioning element and lt is, is pointing at the partitioning element.

Â And so now in, increment both lt and i so that's the first case there. So now we

Â have a second item b. That's less than the partitioning element so exchange with lt

Â and increment them both. So, we're moving the smaller ones than the partitioning

Â element to the left of lt and keeping lt pointing on a partitioning element. Now,

Â what about when we get one that's greater than the partitioning elements? So, in

Â that case, we exchange greater the one over at the right with i and decrement gt.

Â So now, we have one over to the right that we know is greater than the partitioning

Â element. Notice that we didn't increment i because that element z that is over in the

Â right, really hasn't been compared to the partitioning element yet. But we did

Â decrement gt so we made progress. So now what happens here, now i is pointing to a

Â bigger one so we're going to exchange it with the one at gt and decrement gt again.

Â And again, y is bigger, so exchange it, decrement gt. Now we have the c, that one

Â smaller so that's the first case. So we'll move it to the left of lt and increment

Â both lt and i. W is a bigger one, let us to go over to the right. Now we have i

Â pointing to an element that's equal to the partitioning element. And what are we

Â supposed to do then? Well, to maintain the variant there we just need to increment i.

Â And again it 's pointing to one that's equal of partitioning element increment i.

Â And now one more time and now it's pointing to one that's greater so we

Â exchange that with gt and decrement gt and i is pointing to the one that was there

Â and that ones smaller. So we exchange it will lt and increment both i and lt and

Â now where the point, where the pointers have crossed i and gt across there's

Â nothing that we haven't examined yet. So, our partitioning is complete. Here's a

Â trace of Dijkstra 3-way partitioning for his problem which is when there's just

Â three different values in the file. Our problem we were treating partitioning,

Â equal of partitioning element as one value less than as another and greater than as

Â another. And so this, this trace illustrates how we always make some

Â progress and eventually we get the file sorted. The implementation is amazingly

Â simple. The whole partitioning process for three-way partitioning and the modern

Â programming language like Java simply maintains the invariances described in the

Â demo. If we find one that's less we exchange i and lt and increment them both.

Â If it's greater we exchange i and gt and decrement that. Otherwise, we increment i.

Â Could, could hardly be simpler. In fact, is simpler in many ways than these

Â standard implementation of Quicksort. In fact, there's an argument for just using

Â this implementation of Quicksort and forgetting about horse because it performs

Â so well in so many practical situations. Here's a visual trace of 3-way Quicksort

Â for situation with plenty of equal keys. And again, when there's a lot of equal

Â keys then there's going to be place where one of those is chosen, it's partitioning

Â element then a big chunk of the array gets handled just in a partitioning process.

Â 9:20

Now, there's actually some deeper reasons why this method is important and one thing

Â to do is to realize that the lower bound that we talked about before depended on

Â the keys being distinct. So, worst case for lower bounds is when the keys are all

Â distinct. But if we have a situation where there are a lot of equal keys, that model

Â is wrong. It's not too difficult to get a similar lower bound for the model when we

Â know that there are some equal keys. So, for example this is a rather complicated

Â formula but not too bad but in a sense that if you know that the i-th key, it

Â occurs xi times you can write down a lower bound for the number of comparisons that

Â are going to be required in the worst case. And, what's interesting about three

Â way partitioning is that the number of compares that it uses is equal to this

Â lower bound within a constant factor. So that's entropy-optimal and what that means

Â is whatever the distribution of equal keys in there, this thing is going to use a

Â number of compares that's proportional to the best that you could possibly do. The

Â proof for this fact is quite beyond the scope of this course but it's still an

Â important fact. And the, the bottom line is that if you randomize the order and use

Â three-way partitioning then there's lot of applications where your sort routine is

Â going to be linear not N log N so it will be much more faster than Mergesort and you

Â know, the methods for really a broad class of applications. So, taking a look at

Â equal keys is carefully is something that can lead us to very efficient Quicksort

Â