0:14

So again, it's the basis of understanding performance, and lots of large-scale and

Â important applications that are part of our infrastructure right now.

Â One of the big questions is, how much space does a trie take?

Â Well, it's the total number of external nodes.

Â And then, if you have the number of strings you're representing,

Â then you figure out the number of extra nodes.

Â That's the number of void nodes is really the extra space that you're using.

Â What about the expected search cost?

Â Well like with BSTs, that's the external path length.

Â That's the distance from the root to all the external nodes, the average distance.

Â So average external path length this trie is 3.92.

Â That's the search cost.

Â Now you notice that even if half your nodes are void,

Â that's only going to add half to this extra cost.

Â So that's part of the reason that tries are effective.

Â You can have plenty of void nodes.

Â It's not going to affect the search costs all that much.

Â And then leader election, that's the length of the rightmost path,

Â as I talked about before.

Â It turns out we can analyze that all of these.

Â I'm going to talk about external path length, and the others will be exercises.

Â And again, the easiest model, the usual model,

Â is to think of the trie as built from inserting N infinite random bitstrings.

Â Eventually they're going to differ.

Â And where they differ, that's where you're going to get a non-void node.

Â That might represent an infinite tail, but we don't care.

Â The bitstrings distinguish themselves.

Â Eventually the trie represents them,

Â according to distinguishing them on the leading bits.

Â And from an analysis standpoint, it makes for a reasonable model.

Â 2:17

Because we can assume at every node that we randomly go to the left or

Â to the right.

Â And this model fits real data perfectly well when it's been studied.

Â The starting point, if you think back to the analysis that we did for

Â binary search trees, is very similar.

Â It's just that we have a trie of size N.

Â There's k on the left and N- k on the right for some value of k.

Â The difference is that, two differences.

Â One is what's the probability that there's k notes on the left,

Â 3:00

which indicates a binary search trees, which just 1/N.

Â And the other difference is that you might have no nodes on the left and

Â N nodes on the right.

Â That's just a technical thing.

Â That definitely comes directly from the definition of tries, but

Â it makes the recurrence just a tiny bit trickier.

Â 3:21

So that's a recurrence down at the bottom.

Â What's the probability that, when you have N random strings,

Â what's the probability that k of them would start with a zero bit?

Â It's just 1 / 2 to the N times N choose k.

Â Then the external path link of a trie is you look at the root.

Â That adds N to the external path length, because there's N nodes down below it.

Â And then 1 / 2 to the N, so it's a probability that there's k on the left,

Â which is in N choose k / 2 to the N.

Â And then it's the external path lengths, Ck, external path length N- k,

Â with the appropriate starting conditions.

Â And again, it's just a little tricky to realize that there's a Cn on

Â the right-hand side when k = 0.

Â So for BST, the probability that there were k in the left is 1/n.

Â We looked at random Catalan trees,

Â where that's the probabilities [COUGH] [INAUDIBLE] Catalan number.

Â And we had the Catalan distribution.

Â And for tries, it's a binomial distribution.

Â So that's the recurrence, and

Â we can use that to calculate small values or the distribution.

Â But that's the one that we want to solve.

Â And again, just by correspondence to the distributions that we looked at for

Â random Catalan trees and for binary search trees.

Â For binary search strings, the distribution's totally flat and

Â equally likely for any node to be at the root.

Â For Catalan trees, remember, we found that they were very likely

Â to have one side or the other have only a small number of nodes.

Â And so the distribution was very skewed and very skinny trees.

Â For AVL trees, we're trying to make them more balanced.

Â 5:15

And the distribution shows this

Â amazing pattern that now we don't have it characterized analytically.

Â And for tries, the probability that the roots k is binomial.

Â And that's really the characteristic that we're looking for

Â that's going to tell us that these things are going to be pretty well balanced.

Â Because it tends to N over 2 is the most likely thing.

Â And with N squared of N of N over 2 is where the divisions are mostly

Â going to be, which means that it's going to be pretty well-balanced.

Â But let's look at that analytically.

Â Now, I'm not going to cover every detail in great detail,

Â but I think you'll see how we go from one step to the another.

Â And you can check details offline.

Â So this is the basic recurrence.

Â We're going to use exponential generating function for

Â this, and the reason is that N choose k allows us to.

Â If you divide this equation by N factorial,

Â then you C(N) over N factorial on the left.

Â Then sum that all and you get C(z) using

Â the exponential generating function.

Â And then the N choose k part of it is a convolution

Â 6:41

of Ck over k factorial and 1 over k factorial.

Â And without very much calculation at all, you could verify this formula.

Â C(z) =, this is by divided by n factorials,

Â multiplying be z(N), and summing both sides.

Â Pretty quickly comes down to this formula.

Â And you can also actually get this directly through the symbolic method

Â with an appropriate operation, having to do with the way the tree is defined.

Â So very simple formula on the generating function.

Â It's a convolution of e to the z over 2, and c to the z over 2.

Â That's the generating function equation.

Â It can't be characterized as the simplest of generating

Â function equations that we're going to able to solve

Â explicitly because we have that z over 2 in the argument.

Â But one way to deal with it is to just iterate.

Â So if we apply the same equation for c of z over 2.

Â Just plug in the same equation.

Â Then either the z becomes z over 2e to the z over 2- z over 2,

Â and then 2 to the 4 c to the z over 4.

Â We iterate and then the argument gets smaller,

Â and smaller, and simplify.

Â So just collecting terms, but we still have an equation that we can iterate.

Â And in another step, you can see the pattern is that,

Â what we're going to wind up with is ze to the z in every case.

Â And then we have the declining z/2, 3z/4, 7z/8, and so forth.

Â So we get a sum of terms of the form z(e to the z- e to the z/2,

Â 3z/4, 7z/8, and so on.

Â It's not you have to prove that this thing really goes away as you go to infinity.

Â But anyway, that's the iteration that we get.

Â An explicit formula for the generating function for the average external path.

Â That give us the average external path length in a trie.

Â 9:04

Okay, so we're looking for the coefficient of z to the N in that.

Â So, well, N factorial times z to the N in that.

Â If you go ahead and expand e to the z,

Â multiply by N factorial and get the coefficient of z to the N.

Â Then we have this difference, 1- (1- 1/2 to the j) to the N-1).

Â Again that's a fairly simple formula.

Â Explicit formula for the average external path length in a trie.

Â It still does have the infinite sum, and

Â we're going to have to work with it a little bit.

Â I'm working through this with elementary analysis

Â just to make sure that people see what underlies this solution.

Â Because we get to a kind of surprising result, and

Â it's good to see what's underlying, what the structure is.

Â So that's what we're going to do.

Â To characterize this fully, analytically, requires complex analytic methods that

Â we'll talk about in part two of the course.

Â 10:15

So what we're going to see is, on the next slide,

Â we're going to use the and x log approximation.

Â This thing becomes 1/1- e to the -N/2j.

Â And that's not a difficult calculation using x log.

Â In the next slide, we'll show that that's pretty close to N log base 2 of N.

Â 10:43

So that's the external path length in a trie.

Â This step is not difficult.

Â I didn't mean to skip through it so quickly, but

Â it's not difficult at all to show that with x log.

Â But now, what I'm interested in is looking at this sum.

Â Sum on j greater than or equal to 0, 1 minus e to the minus N over 2 to the j.

Â 11:06

How we're going to characterize that.

Â And that's a really interesting function to take a look at.

Â What I'm going to try to do with this analysis is try to

Â isolate the periodic terms.

Â There's an oscillation in here and

Â I want to try to isolate the part that oscillates.

Â The way that's going to happen is, we're going to take the function,

Â log base 2 of N, and we're going to work with the integer part of log base 2 of N.

Â And then the fluctuation is every time you come to an integer,

Â as you go from one integer to the next, there's fluctuation in the function.

Â Where as you're working with the real function log x, but

Â you're only picking off the integer parts of it.

Â So let's see how it works.

Â So all we're going to do is take that infinite sum and

Â we're going to break it into two parts.

Â The part where j is less than floor of log N, and

Â the part where it's greater than or equal to floor of log N.

Â So just split the sum into two parts at that one point.

Â And the key thing to notice about this is when j gets bigger than log base 2 of N,

Â then 2 to the j is going to be bigger than N.

Â And so we're going to have -N over something huge.

Â We're going to get a number very close to 1.

Â It's going to go away.

Â When j is small, like say j is 2 or something,

Â we have e to the -N, which is tiny.

Â The sum is just one.

Â So basically, were going to have log N terms that are very close to 1 and

Â all the rest of them are going to be very close to 0,

Â with just a little bit left in the center.

Â So let's look at how that goes.

Â In this case, we just split off the log N so we get floor of log N.

Â And then we have the second one, e to the -N / 2 to the j,

Â and that's just splitting the first sum into two parts.

Â And then this one is the same, so no change there.

Â 13:13

So now, what we're going to do is

Â let this j go as far negative as as it wants.

Â Doesn't matter because if that j goes far negative,

Â this thing is exponentially small.

Â So it doesn't matter.

Â We're off by an exponentially small amount.

Â That's going to allow us to combine the two sums in that part.

Â So change j to j plus log N in this sum.

Â 13:49

And same way, change j to j- log N in that sum.

Â And now we have some floor to the log Ns coming into these sums.

Â But then, there's N over floor to the log N.

Â Well that's like taking a function log N, subtracting off the integer part.

Â All that's left is the fractional part.

Â So the end sum of this thing is,

Â what's floor of log N?

Â That's the biggest integer less than log N.

Â That's the function log N- floor of log N.

Â As N increases, this function, fractional part of log N,

Â goes from 0 to 1 and back again.

Â So it's a fractional part of log N.

Â It's an oscillating function.

Â But these other functions, also, are just functions of the fractional part of log N.

Â Again, I skipped through just a tiny bit of subtracting log N from floor log N,

Â but you'll see that that works out.

Â So now I have these three parts that are all

Â functions of the fractional part of log N.

Â That is, they oscillate.

Â So this is my external path length over N.

Â So right here, this is a proof that it's asymptotic to N log base 2N,

Â because the other things are all 0 to 1.

Â But what's really interesting about this is,

Â if we look at these functions, this is the proof that I just went through.

Â If you have that recurrence, then that leads to this result.

Â But let's look at it in just a little more detail, these three terms.

Â 15:42

Now the second one is this sum of e to the -2 fractional part of log N-j.

Â And that one also oscillates as N increases.

Â It oscillates, and that's the approximate magnitude of it.

Â And then this third one, it also oscillates.

Â So we have these three terms that oscillate that we're adding together to

Â give us the difference between log N in our function.

Â And what's amazing is, if you add these all together,

Â you get a smooth oscillating curve that's very tiny in magnitude.

Â 10 to the -6 in magnitude.

Â 16:27

Not something you would notice if you didn't do the math.

Â And we're going to see in part two,

Â we'll look into a little bit how to do the math.

Â But certainly, it's quite a surprise to see these functions that are so

Â difficult, different in character adding up to this continuous function.

Â 16:52

So this is a proof that CN / N- log N is this strange oscillating function

Â that you wouldn't see unless you looked out to six decimal places.

Â And the question is is there a reason why a recurrence like that,

Â which seems such a natural recurrence involving our integer cost,

Â should have this strange periodic behavior.

Â And the answer is yes.

Â We're going to see,

Â when we get to complex analytic techniques in the Mellin Transform,

Â that there is a perfectly valid explanation for such periodicity.

Â And not only that, it's bound to appear in the analysis of lots of algorithms

Â that are based on properties of bitstrings like tries.

Â 17:37

So applying that result and taking a look at the distribution that we get with

Â tries, you can see that it's even more tightly bound towards a particular shape.

Â Basically, divided at the center.

Â Even than BSTs.

Â And so, just to quote the results that we get from this kind of analysis,

Â the extra space is about 44% of the external nodes, or void.

Â 18:07

The expected search cost is about N log raised 2 of N.

Â So that's the number of bits you have to examine.

Â And that's a very acceptable search cost and competitive.

Â And for leader election, well, that'll be an exercise.

Â That's a discussion of trie parameters.

Â