0:04

Now we're going to look at tries, which is a class of common material objects that

Â not really has only come under study in recent years due to computer applications.

Â They're not found in classical common but

Â they're actually a very interesting rich combinatory analytic properties.

Â So I'll spend some time really talking about what tries are and

Â the application before I talk about the analysis.

Â So when we look at the tries just as a binary tree,

Â where the external nodes can be marked.

Â 0:51

Now there's a rule and that is that you can never have two void

Â nodes that are siblings of each other in a leaf.

Â So that a sibling of a void node is not void,

Â that's the rule, so that's what a trie is.

Â Now that seems kind of arbitrary but you'll see when we

Â look at applications, how these rules play a role.

Â Actually for an exercise, I might give a recursive definition of what a trie is.

Â So the most usual way we think of trie is as representing a set of good strings.

Â Each trie corresponds to a set of bitstring where each nonvoid,

Â external node represents one bitstring.

Â And you get the big string by taking the path from the root to a node,

Â taking 0 when you go left and 1 when you go right.

Â For example, if we choose 00, left, left, right,

Â right, left, we get down to that nonvoid external node.

Â And we say that node represents the bitstring that we got that defines

Â the path that we got there, 00110.

Â 2:09

Now over here, 1010, that nonvoid

Â node represents the bitstream 1010.

Â The path from the root to a node defines the bitstring.

Â So now the void nodes,

Â I have a different interpretation that derives right from this definition.

Â For example, if we go right four times and then left, we come to a void node.

Â What that means is that no string with that prefix

Â is in the set of strings represented by this trie.

Â The trie represent in this case one two, three, four, five,

Â six, seven bitstrings and in the void node represent the prefix of all

Â the bitstrings that aren't represented.

Â So that's what we think of a try corresponds to set of bitstrings or

Â another way to look at it, is it's all worked out here.

Â So this trie represents as I said,

Â seven different bitstrings, and those are shown at the top.

Â Now on the left is the bitstrings from that set that start with 0, and trie on

Â the right represent the bitstrings of that set start with 1 with the 1 script of.

Â And so recursibly going down,

Â that's another way to see the sets of bitstrings that are represented.

Â Now this only works for sets of bitstrings are said to be prefix free.

Â So that is no member of this set of bit strings is

Â a prefix of another one in that set.

Â We can handle that by using void and not void, internal nodes.

Â But in applications, I'm going to talk about that are typical,

Â it's okay to just work with prefix free sets.

Â 4:06

So for example, fixed length,

Â all those bit strings are the same length then it's prefix free.

Â Because they're all different, they're all the same length and they're different.

Â You can't have one being prefixed not another,

Â and that's a typical and useful application of tries.

Â So this is a trie that represents those three strings and

Â lots of applications of tries if you look in our algorithms

Â book you'll find trie code for sorting.

Â For simple tables with string keys and for

Â suffix arrays which I'll refer to in a minute.

Â They play a role in classic data compression algorithms Huffman code.

Â And Lempelâ€“Zivâ€“Welch compression and we'll look at the use of tries to

Â understand decision making collision resolution leader election algorithms.

Â They play a very important role nowadays in network systmes and

Â bioinformatics, Internet search, all kinds of commercial data processing.

Â Very important data structure that's often overlooked that's

Â where I'm taking the time to talk about some of these applications to

Â motivate the analysis because it's not so easy to analyze as we'll see.

Â So here's the basic application which is for symbol tables.

Â Trie represents a set of bitstrings so what we have is,

Â just going from the definition.

Â A search algorithm for

Â determining whether a given bitstring is in the set represented by the trie.

Â And the basic idea is if the leading bit of your key is 0, go to the left,

Â if it's 1, go to the right then use the remainder of the string recursively.

Â If you get to a void external node, it means that the one you're looking for

Â is not in the set represented by the trie.

Â If you get to a nonvoid external node and

Â you're at the end of your bitstring, then you report success, you did find the key.

Â So for example, let's say we're going to search for the bitstring 0011 in this tri.

Â Start with a 0, go to the left.

Â Next one's a 0, go to the left.

Â Next one's a 1, go to the right.

Â Next one's a 1, go to the right.

Â We're at the end of our string and we're on a nonvoid external node.

Â So that string is in the set of business strings represented by our trie.

Â Let's look for 10110.

Â So we start with a 1 and go to the right, 0 go to the left,

Â 1 go to the right, 1 go to the right.

Â We hit a void external node, so

Â that string is not in the set represented by our trie.

Â Very natural search algorithm have the trie represent a set of bitstring.

Â Of course everything can be represented as a bitstring,

Â so this is a natural algorithm for anything represented by in a computer.

Â 7:17

Now what about inserting new keys or

Â a new bitstring into the set represented by the trie.

Â Well what we'll do is we'll

Â insert by searching until we get to a void external node.

Â So if we wind up at a internal node or nonvoid external node that means

Â that we'll have a prefix free violation and we can deal with that in some way.

Â But insert a new key, it's going to wind up at a void external node.

Â So to insert 01110, we go left for the 0,

Â 1 right for the 1, and now we're at a void external node.

Â So now what we want to do is for each remaining bit

Â in our key that we want to insert we want to add a new internal node.

Â And with one void external child and then the other one corresponding to our bit.

Â So in this case our next bit is 1.

Â So we put if the key start at 010 then it is not in the set of strings, but

Â 011 that could be this one.

Â Then we do it again for another one, and then the next bit is 0.

Â So we put the void external node to the right and nonvoid to the left.

Â So that's we would insert 01110 into this trial.

Â 8:36

Now there are variants where you just keep track of the tail in

Â some way with pointers and people, and those are well studied,

Â and there's lots of reasons to do that sort of thing.

Â But the simplest version also is very effective and

Â that's what we'll stick with right now.

Â So that's a search algorithm and an insertion algorithm,

Â that gives the basis for simple table using the trie data structure,

Â which is a very useful algorithm.

Â And then a natural question that probably occurred to many of you is,

Â what about these void external nodes?

Â It seems kind of a waste to have all these void external nodes there in

Â this data structure.

Â And so we're going to want to analyze how many there are to make sure that we

Â understand how much space it trie it takes.

Â But it's a very compact data structure and

Â that analysis is certainly interesting and relevant in practice.

Â So here's another application of tries,

Â what we want to do is we have a given string, S.

Â And just for an example, I'll use a genomic string made up of As,

Â Cs, Ts, and Gs.

Â And these things could be huge, it would be billions of letters.

Â And we want to know is a particular substring in our string for

Â this string, is ACCTA in there?

Â And the answer is yes, starting at 0.

Â What about CCT?

Â Yeah, there's plenty of places where CCT occur in this string.

Â What about TGA?

Â No, there's no occurrence of TGA.

Â So we have a specific string that's huge and

Â we want to be able to quickly do substring search.

Â There's all kinds of application in genomics where it's important

Â to be able to do an operation like this quickly.

Â So searching genomic data, and this is also useful in Internet search.

Â When you do a Google search you not only get to the page that you're looking for,

Â but you get context you get where the substring is in that page.

Â And that uses data structure like this and many other applications.

Â 10:57

So the solution method that I'll talk about

Â is the so-called suffix multiway trie.

Â Generalizing the trie for this problem and so

Â the idea is if you have given string well what were

Â going to do is work with all the suffix of the string.

Â So the original string, ACCTA GGCCT,

Â we leave off the A, then we have CCTA, and so forth.

Â So if the string is of size N, then we have N strings for all the suffixes.

Â And we're going to treat those as different strings,

Â and just insert them into the trie, that's called a suffix trie.

Â You now notice that's prefix free, none of these is a prefix

Â of another because they're all different lengths.

Â It's a prefix for each set, and we have them

Â all end with a character that isn't found anywhere else in the trie.

Â So then the idea is that every internal node of this

Â trie corresponds to some substring of our original string.

Â So to answer the question, is x a substring of x?

Â We use the characters of our query to traverse the trie.

Â So for example, if we're looking for

Â ACCTA then [COUGH] when we get to a nonvoid external

Â node that's tells us that that one is in there and

Â tells us what position it is so we built the trie AC.

Â Now if we see something that starts with AC, then we look starting

Â with position 0 and continue our search and this CCTA is there.

Â If we encounter a void node then so

Â for example, if CCT is there because

Â we found that in the internal node.

Â But something like TGA, we end up at a void node and router there.

Â If we reach the end of the string then we're fine,

Â if we hit a void node we're fine, where a node that is not there.

Â So this is a very simple algorithm to answer this is x a substring of s

Â question, and a fine application of tries.

Â And again the number of void nodes and what does it mean to be a random trie and

Â how long is the search and so forth.

Â All of these kinds of questions are going to be important and

Â relevant in practice.

Â Here's another application of tries.

Â Tries is a model for an algorithm, so called leader election algorithm.

Â This is important in distributed systems, so

Â the idea is you have a group of individuals.

Â And they would need to elect a leader, and

Â what they're going to do is each flip a coin.

Â So it's than distributed, there could be a large number of them,

Â they'll each flip a coin and we'll count 1s as winners and 0s as losers.

Â So everybody that gets a one when they flip a coin survives for

Â the next round and the ones that get zero are gone.

Â 15:24

So obviously,

Â we are going to be interested in what's the chance of failure.

Â Well it's the probability that the right most path in a random trie

Â ends in a void node.

Â And what do I mean by a random try?

Â Well it turns out that the model that associates with this, and

Â also works for symbol tables is, it's a trie that you get by inserting

Â infinite length random bitstrings into an initially empty trie.

Â So there's three diverse applications of the trie data structure for

Â a simple table, for substring search and for distributed leader election.

Â And all of this is to motivate studying this in a combinatorial structure.

Â