0:00

.finally Finally we're going to look at suffix arrays and string processing using

Â this data structure that has really played a very important role in, string

Â processing applications that would not otherwise be possible.

Â To get a feeling for the idea, we're going to look at a really old idea that you're

Â actually familiar with called keyword in context search.

Â And the idea is that you're given a text, a huge X and what you want to do is

Â pre-process it to enable fast substring search.

Â That is, you want a client to be able to give a query string and then you want to

Â give all occurrences of that query string in context.

Â So, if you look for the word, search, it will give all the occurrences of where the

Â word search occurs in context. Or better thing is another one.

Â And that's a, a very common operation. One, you're certainly familiar with it

Â from web searching, in your browser. And there's many other applications.

Â This is a pretty old idea that dates back to the late 50's, early 60's people have

Â always wanted to do this. And there's an easy way to look at it

Â called suffix sorting. The idea is you take your input string,

Â and then, form the suffixes. remember when I talked about at the beginning with

Â JAVA's string data type, you can get this done in linear time because each suffix is

Â basically a pointer back into the input string.

Â So the suffix, remember, the I suffix just start the character I and take the rest of

Â the string. So, what this does is, it gives, sort

Â keys, that, contain, kind of the, the, Pieces of the string itself in context and

Â all we do is just, run a sort on that suffix.

Â And what that sort does is, it brings if you do a search, it brings the things that

Â you're searching for, close together. And, you can use, once you've done that

Â suffix sort, you can use a binary search to find all occurrences of the string out

Â of there. That's the, basic idea of a keyword and

Â context searching. You suffix sort the text, then do binary

Â search to find the query that you're looking for.

Â And then you can, scan for wherever the binary search ends up.

Â And so, this is all the places where, the word search occurs, in the text of Tale of

Â Two Cities, And then, you can use this index, to,

Â print out the context of whatever's needed.

Â It's a fine and effective way, for solving this important practical problem.

Â And that's interesting, but I want to talk about another really important problem

Â that, it has, hugely important applications. This is the longest repeated

Â substring problem. So you're given a string of characters,

Â find the longest repeated substring. in this case, this is genomic data, and,

Â there's the example of the longest, repeated substring. And now, in scientific

Â data, the long repeated substring is often something that scientists are looking for.

Â And so, and these strings are, are huge. it might be billions of these, characters.

Â And so, it's really important, not only to know that you can find long substrings

Â efficiently, but also, that you can do it for, huge, huge strings.

Â Another example is, cryptanalysis. Where, a long repeat in a file that's

Â supposed to be, encrypted random file indicates that somebody did something

Â wrong. It might be the key to, breaking the code.

Â Another example is data compression. When you've got a file that's got a lot of

Â repeated stuff in it, you might, want to do this operation, to take advantage of,

Â of these long repeats and not keep multiple copies of them.

Â Here's another example, this for, studying or visualizing repetitions in music.

Â In this example, every time there's a, a repeat of the notes then, there's, a, an

Â arch drawn to, visualize the repeat. And it's, the arch is thick if, the number

Â of repeats is long. And it's high if the repeats are far away.

Â And this, tells, is an interesting way to analyze, repetitions, in music.

Â So, how are we going to solve this problem?

Â Very simple to state problem. Given a sequence of N characters, find the

Â longest repeated substring. As with many problems, there's, an easy,

Â brute force algorithm, that, at least gives us an idea of what, what the

Â computation is like. But it's not going to be useful for huge

Â strings. And that's you try all possibilities, all

Â pairs of indices I and J and then just compute the longest common prefix for each

Â pair. The problem with that is that if N is the

Â length of the string, you've got, N squared over two pairs.

Â And for every one of those pairs, you might have to, match them up to length D.

Â It's definitely quadratic time, more than quadratic time in the length of the

Â string. And can't be using that for, you're not

Â going to be able to use that for strings that are billions of characters long.

Â Another way to look at one way to solve the longest repeated substring, is to use

Â a suffix sort. In fact, just we talked about before.

Â So, I would take out our input string. Reform the suffixes.

Â And then sort the suffixes and that brings the long repeated substrings together.

Â So, that's a quite elegant and efficient solution to this problem. just build the

Â suffix array that's a linear time process. Do the sort.

Â We, we just saw we can do that, with n log n character, character compares.

Â And then go through and find the repeated the substrings.

Â So, it's just one passthrough to check for adjacent substrings to see which one's the

Â longest. And this is very easy to code up.

Â Here's the Java code for computing the longest repeated substring.

Â We get the length of our string out. We build our suffix array.

Â Remember that's linear time and space because of Java string implementation

Â allows us to do substring and constant time.

Â Then we go ahead and sort the suffixes, and then find the least common prefix

Â between the, adjacent suffixes in sorted order.

Â Just keep track of the max so that's longest, repeated substring.

Â And if so, for example, this sentence about such a funny, sporty, gamy, jesty,

Â joky, hoky-pokie, lad, is the longest repeated substring in the text of

Â Melville's Moby Dick. And now we run that one to, prove that

Â we've got a linear-rhythmic algorithm because there's a lot of characters in

Â that. And you're not going to find this, without

Â a good algorithm. And, and this is just a humorous approach

Â to what we've, talked about today. If you have, five scientists that are

Â looking for a long substring, in a genome, that, they might encounter, this problem,

Â with a billion nucleotides, and there are plenty of scientists that are not aware

Â of, important algorithms are the ones, like the ones that we are talking about.

Â And the one that uses the good algorithm is definitely more likely to find a cure

Â for cancer. But there is a flaw even in this, that

Â computer scientists discovered as they got into ever more complex algorithms for

Â problems like this, and that is if the, the longest repeated substring is long,

Â there's a problem. So this is just some, experiments, for,

Â finding the longest, repeated substring in various files.

Â From just a program to, the text of Moby Dick.

Â Or a chromosome with 7.1 million characters.

Â And, using, the, brute force method, you get stuck, and too slow, very soon.

Â But using, a suffix sort, you can get these jobs done, even for huge files like

Â the first ten million digits of pi. By the way, if there was a really long

Â repeated substring in the first in the digits of pi, that would be news.

Â Maybe it would help us, understand more, about this number.

Â So lots of people, write programs of this sort.

Â But the big problem is if you have a really long repeated substring, then

Â suffix sort is not going to work. Our fast algorithm, doesn't complete.

Â So what's going on? In the explanation is really simple.

Â 11:48

And so, I want to describe that, briefly. Actually, these two computer scientists,

Â one of them went on the become chief scientist of an internet company.

Â The other one went on to become chief scientist of a company that won the race

Â in sequencing the genome. In both cases, good algorithms for

Â processing long strings are very important.

Â And this is a great algorithm. Now, it's a little too complex to give all

Â the details in a lecture, but I can give a pretty good idea of how it works.

Â So the overview is that, it's kind of like an MSD, sort.

Â But, what it manages to do is double the number of characters that it looks at in

Â each passthrough of the array. And, the idea is that, you, since you're

Â doubling the number of characters that you look at each time.

Â It, it finishes in, log in time that's the size of the, suffix array.

Â If, if you double until you get n, it's log n.

Â And it turns out, what's really ingenious about the algorithm is that, you can do

Â it, do each phase in linear time. So, this is an example of how it works.

Â So, it's, we going to do a suffix sort. And then I'll, I'll talk about the least

Â common prefix as well in a minute. So, sorting the first character, while we

Â just use key, key index counting or something like that.

Â So we know how to do that, in linear time. And then the idea has given that sorted on

Â the first character. We can sort it, easily sort on the first

Â two. Well again, we could use, key index

Â counting. So now, what we can do is, double the

Â number of characters that we, involve each time.

Â So, the next phase of the Manber-Myers algorithm, now gets it sorted on four

Â characters. And, of course, as we, the more characters

Â we have sorted on at the leading part, the smaller the [INAUDIBLE] files in the

Â trading, in the trailing part. So, that's, certainly a feature.

Â And then, in this case, after we get to, eight characters, and all our sub files

Â are of size one, and we're done with the sort.

Â 14:11

Now, the key to the algorithm is, the idea that, once it's going, it uses what's

Â called an index sort, Which essentially means that it can do

Â compares in constant time in the middle of the algorithm.

Â Now, lets just take a look at, at, at how that works.

Â So lets look at, trying to compare string zero with string, nine.

Â And we know that we've already got the thing sorted on, four characters and we

Â want to sort it on eight So, zero and nine are equal on the first four characters.

Â They're in the same subfile. So, now we're going to get it sorted on

Â the other. But the thing is if we add four to our

Â given indices. So, we're at zero, and we add four.

Â That gets us to four, and we're at nine. We add four that gets us to thirteen. we

Â already know that the thing is sorted on those characters.

Â And, we know that, thirteen appears before four in this sorted list.

Â It's already sorted. And we can keep track of that by, using an

Â inverse array which says, for every index, where it appears in the sorted order.

Â So, this says that thirteen appears at position six, and four appears at position

Â seven. That is, thirteen appears before four.

Â So, when we're trying to compare, the, this, string here, that's the zeroth

Â suffix with the ninth suffix, We can go in and again, add four to get

Â four, add four to get thirteen go into the index array and see which one is less and

Â the one that appears first is going to be less.

Â So, that's a constant time, comparison. Thirteen is less than or equal to four

Â because the inverse of thirteen is less than the inverse four, which means that

Â suffixes of nine, if you take eight characters out of nine and eight

Â characters out of zero, it's less. It's a really, simple and of course,

Â Easy to implement operation. So, maintaining the inverse, I can get

Â constant time string compare. So, what does that imply, for the whole

Â suffixsort??uh Well,

Â With a constant time compare, then, I can get, at a minimum, I can get an, an analog

Â N sort, just by using quicksort or mergesort.

Â And then, I get much faster than quadratic performance no matter what the strings

Â are. That's a big key to the success of this

Â algorithm. And actually, if, if you use a version of

Â three-way quicksort, it's been proven that it even gets down to linear time for the

Â sort, No matter what the input is.

Â That's one thing, and then the other thing is when you need to do, the longest common

Â prefix to look for the, longest match after the array is sorted you can also do

Â this constant time, string compare and get the job done.

Â So this is really an in-, ingenious algorithm that, can get, suffix sorting

Â done very fast. even in the presence of, a huge repeat.

Â And I think really underscores, the importance of careful algorithmic

Â thinking, in addressing new computational challenges.

Â So, string sorting, just to summarize, is a really, interesting area of inquiry.

Â For one thing, we can develop linear time sorts for many, many applications.

Â We, we thought, we're doing well when we had, n log n sorts, but actually we can do

Â much better for many applications. And in fact, we can even get to sublinear

Â where we don't even have to examine all the data.

Â In today's world, when we have huge amounts of data, to be able to, get it

Â sorted without even looking, at it at all is really quite a miracle.

Â Three-way string quicksort,,you you can't do better than that in terms of the

Â numbers of characters that you have to, examine.

Â And it's a lot of deep mathematical analysis, behind that result.

Â But I think the other lesson to learn is that algorithms like three-way string

Â quicksort and Manber-Myers, show that we really have a lot to learn still in string

Â processing. And they're not really random.

Â In fact, a lot of times we're looking for non-randomness and we might want to use

Â specialized algorithms. So, at string processes or introduction to

Â string processing, We're going to look at many other string

Â processing examples in the coming lectures.

Â