Now, this is a little bit complicated trace,
But if you look at it more carefully after the lecture you'll see that it's pretty
simple, setup for what we need to do. So, here's our input and we sort on the
first character. If we sort on the first character,
In this example, we have lots of s's. So the subfile for the s's, it's already
sorted on the first character. This d is the digit that we're currently
working on. So then, we sort that on, it's, the rest of it on its first
character. So it's the second character and all the
words that start with s. And then there's a lot of them that start
with e, So we move on to the third character for
those, And then there's some that start with a
and so forth. So, recursively,
Every time that we move on one character, we have to keep going until one thing we
have to do, is if there's two keys that are equal, we have to examine every
character in the key. We never find out that the two keys are
equal until the end. And if we reach the end of a string, we
can just assume that goes before any character value.
And with those two things, then you, you can trace through the recursion to see how
MSD string sort of works. So and again, in this case, it sorts by
the first character, A and B have only size once, so you don't have to do
anything. So then most of this slide is showing what
happens in the keys that start with S, and then at the end, there's the two keys that
start with T, and they have both h's and both e's, we have to go through the whole
thing. So that's an example of MSD string sort.
Now, if strings are variable length like they were in that example, we just treat
them as if they had an extra character at the end that's smaller than any character.
So they'll appear before they're supposed to appear alphabetically, before strings
that have the same starting characters that are longer.
And you can do that by overloading, you know, implementing charAt to just return
minus one, if, if we're past the character that we're looking at.
So, that's the easy part of the implementation.
One thing to point out, the in the C programming language, the representation
of strings puts an extra character that's zero at the end and those string has the
zero character. So, actually not to do anything at all.
But when that change the code for MSD string sort is also really an extremely
simple modification or extension to key index counting.
So we've got our sort and it takes its input array and then, output buffer,
And then we have to take the deliminators of the subfile we're going to sort, low
and high just like we've done for other recursive sorts.
We go ahead and we do the key index counting, again, to take care of, we have
to take care of the fact that we have an extra character that kind of the end of
string character. And also, pull out the d character of our
string just like we did for LSD sort. And we just go through the whole thing and
sort according to the given character. And then, what we do next is just do a
recursive call for every entry in the counter array where we, we just sort the
part of the array from count of r to lo plus count of r or lo plus count of r plus
one. Really just one line of curve for each
subarray. And then we move to the right one
character. So, again, this is very little code to get
a very useful and powerful sorting method, That's MSD string sorting.
One thing to notice is that, with the, extra output buffer,
We can use that even with the recursive calls,
But not the counter array, We have to keep the counter array around.
So it's part, it's gotta be local to the recursive procedure,
Cuz when we call for the next character, we're going to need, a new counter array
for that character but have to save the old one to do the next subarray for, the
calling, method. Well that turns out to be important
because there's a potential for big problems with MSD sort when we start
looking at the analysis. And the thing to notice is that if you
have a tiny array, say an array of size two, it doesn't matter how small your
array is, you need to have a counter array for each potential character in the
alphabet. So for ASCII, just to, to initialize the
counter and to set it to zero, just create it, it's going to be a hundred times
slower, then just sorting the thing or just copying it back, even if you're using
Unicode, it's 32,000 times slower. And what's worse is it's a recursive program,
so there's going to be lots of small subarrays.
Sorry. This feature or characteristic of MSD
string sort actually [INAUDIBLE], we're having a lot of applications that were
using it when people switched from ASCII to Unicode a while back.
All of a sudden programs that were really efficient sorts,
All of a sudden, became hundreds of times slower, with the switch to Unicode,
Because these big arrays in the recursive, these big arrays in the recursive
procedure. It was a, a serious performance problem,
So we definitely have to watch out for that, with, MSD string sort.
Now there is a good solution to avoid this danger and it's the same solution we've
used before. If you've got a small subarray and the
sort is going to be slower just cut off the insertion sort.
The other thing you can do is and save some more time, is to just have insertion
sort start at the character that we're currently working on.
Cuz we know things are, equal, to the left and we're just looking for the right.
And, that's easy to implement just by changing the implementation of the compare
function, to take into account which character we're at.
Notice it's quicker, And, and it happens to be quicker in Java
to just pull out the substrings and use compare to, than to go in and get the
charAt that's just a feature of the implementation.
So switching to, cutting off the insertion sort for small subarrays is definitely a
good idea for MSD string sort. And so, what about the performance of this
method? Well the key characteristic of, of MSD
string sorting, it examines just the characters that it needs to examine in
order to get the C key sorted. So, that means that its performance is
going to be really dependent on the data. Now, in the, worst case for the algorithm,
it has to examine all the data, in this case all the characters in all the
strings, and that's, for example, when they're all equal, it's going to look at
all the characters. If you have some duplicate keys, it might
have to examine all the characters in those duplicate keys, but there's plenty
of other strings that it doesn't examine all the characters. so,
That, depending on, and end of the keys are random in some way, if they, or
they're approximated by random keys. Then it's not going to examine very many
characters at all, Actually and this is a typical case, say,
for account numbers or some situation library call numbers or some situation
like that, In a case like that.
Msd string sort will examine only a small fraction of the data,
A small constant fraction of the data, And it's always going to be sublinear.
Now, it's also possible that sorts that do comparisons can be sublinear, but MSD
string sort is, you, you know, good and that it really only examines the
characters that need to be examined in order to get the sort done.
So this gives us another line on, on the table,
And the key here is that if the data is approximated by random log base r of N is
going to be pretty well approximated by a constant in the real world.
So this is going to be a fast method. The one danger is that, you have to worry
out [inaudible], worry about using too much extra space, with those big counter
arrays. But that's a important and useful,
practical sorting method. So now let's look at, MSD string sort
versus Quicksort for strings. There, it, they are similar in many ways.
They're both recursive methods that, partition a file up.
So one of the problems with MSD string sort is that, It tends to, when it's doing
the counting, it's kind of making random accesses to memory when it, particularly
when it's distributing the keys out. So on modern systems that have caches not
everything is in the fastest memory at, at the same time and programs that move from
one piece of data to another right next are, are going to be much more efficient.
And quicksorts like that, an MSD quicksort is not.