0:03

Next we're going to talk about the analysis of path length in binary trees

Â and this is an important problem to study because,

Â that's the quantity that we use to predict performance in the analysis

Â of the binary search algorithm that I talked about before.

Â So, here's the definition that I gave before for

Â a binary tree in the level and the depth.

Â When we talk about path length, what we what to compute is

Â the average length of a path to a node that's in the tree, say an internal node.

Â So to do that we compute the total internal path link and

Â then dividing it by n gives that average cost.

Â 0:52

And so to compute the total, you just count the number of nodes at each level.

Â So level one there's 2 nodes, the level 2 there's 4,

Â level 3 there's 3, 4 there's 1, 5 there's 1, and you add that all up.

Â That's called the internal path length of a binary tree,

Â it's the sum over all, nodes of

Â length of path from the root to that node, or the sum of k times the number at k.

Â 1:31

And also, that's a quantity of interest in studying the binary search algorithm.

Â So these things, there's relationships among these that are easy to prove.

Â It's all recursive and there's a lot of notation to be able to talk about

Â these things so let's look at some simple relationships among them.

Â So we're talking about binary trees and we have a size function.

Â And just a different size function for external nodes.

Â And then left and right subtrees we'll indicate with tL and tR.

Â IPL stands for the internal path length.

Â XPL stands for the external path length.

Â So that's all the notation.

Â And again, these are simple quantities to define, as they did on the previous slide.

Â So here's some recursive relationships.

Â So one thing is,

Â the number of internal nodes in a tree is the number of internal nodes in the left,

Â plus the number of internal nodes on the right, plus one for the root.

Â 2:49

Internal path length, so this is an interesting one.

Â If you want the internal path length of the whole tree you take the internal path

Â length on the left and the internal path length on the right and add those together

Â and then what you are is you're off by one on every single node except the root.

Â So if you add t minus one then you're adding this link to all the nodes.

Â And so that's a recursive relationship for the external path length and

Â for internal path length and for external path length it's similar.

Â External path length on the left, external path length on the right but

Â you're off by one for every node because you didn't take into account the nodes

Â that connect the subtrees to the root.

Â So those are simple, but very useful, recursive relationships.

Â And then from those relationships, you can prove easy things,

Â like the number of external nodes is equal to the number of internal nodes plus 1,

Â and that's just by induction.

Â 3:57

So just plug in from this formula here.

Â The number of external nodes is the sum of the external nodes in the two parts.

Â By the inductive hypothesis, that's equal to tl plus 1 plus tr plus 1.

Â And then using the recursive relationship for

Â internal notes, we're left with internal nodes plus 1.

Â 4:28

plus 2 times the number of internal nodes in the tree.

Â And again that's inductive right from the simple recursive relationships

Â that we talked about.

Â So, and again, there's a lot of ways to prove those things, too.

Â Just using these as an exercise for getting used to the idea of path length,

Â and its relationship with a path length and

Â the sub trees, to the path length in the whole trees.

Â That is critical to the analysis that we are going to do.

Â 4:57

Okay, so here's our first problem.

Â We have a random binary tree so that's a binary tree where all tree shapes

Â are equally likely with probability one over, catalan numbers one over N plus.

Â What's the average path length of a random binary tree?

Â That's some kind of description of the tree shapes that we saw.

Â How far are all the nodes from the root and

Â how far is an average node from the root?

Â That's sort of what we're saying with that.

Â So, these are the quantities that we're going to work with.

Â So Q and K is the number of trees with N nodes and internal path link of K.

Â TN is the number of trees that's the cattlen numbers and

Â to accumulate costs is the total internal path link on all trees.

Â So that we get the average by dividing the cumulated cost by the number of trees.

Â That's the way that we compute average values of parameters.

Â So this is for two, so in this case, both of the trees

Â with two nodes have internal path length of 1.

Â So the total accumulated mature path length is 1,

Â the average internal path length is 1, that's still not so interesting.

Â Now this is a little more interesting.

Â So there's five trees with three nodes.

Â Four of them have internal path length, three.

Â 0 to the root, one to one below.

Â And two more to the next one.

Â So that's three.

Â Four of them have that structure.

Â One of them has internal path length.

Â Two the nodes are both just one from the root.

Â So Q32 equals one.

Â Is one tree size two has an internal path length to one and

Â there's four trees to size three that have internal path length three.

Â And so, that's a bug, it's t3=5, so

Â the average internal path length is 14 over 5, it's 2.8.

Â So that's the figures for three and here's the corresponding figures for four.

Â Total path lengths in all of these trees is 74 and there's 14 of them.

Â So the average expected internal path length

Â is 5.2 in this trees at size 4.

Â And then you can take that and say that the average distance to an internal node,

Â divide that by 4 again to get about how far a node is from the root.

Â So that's the quantities that we're after.

Â It's always good to do small cases like this to

Â make sure that you have a good fix on the exact quantities that you're analyzing.

Â 7:50

Expected path link of a random binary tree.

Â So, here's the analytic common [INAUDIBLE] deviation from this,

Â and these are the things we can defined so far.

Â So the counting GF is just the Catalans and

Â if that's all the information we know about the Catalan numbers.

Â The cumulative cost generating function is sum over for

Â all trees internal path length times z to the tree size.

Â 8:20

And that's also equal to the sum on N, sum on k and the path length k and so forth.

Â But from now on we're just going to go with cumulative counting.

Â And we know that the average is going to be the coefficient of z to the n and

Â the Q divided by the coefficient of z to the n and the T.

Â Now the coefficient of z to the n and

Â the Q is the total internal path length of all trees the size n.

Â If you divide that by the number of trees the size n then you get the average

Â internal path length.

Â 9:31

And so from that definition we can

Â immediately write down this decomposition.

Â This first one's for the empty tree and this other one is for the roof.

Â But if we have that function, we look at this decomposition, ipl(t),

Â is exactly equal to that and the size of T is exactly equal to that.

Â And so we have that double sum for the cumulative cost.

Â And that decomposition or

Â looking the other way that construction now we can rearrange terms.

Â And you could see for example ipl(tl)z to the phi l times E to tr, that's Q(z)T(z).

Â So that's one of the terms that we find in there.

Â And the other ones where these tl's come into effect we get T'(z)T(z).

Â This formula has those four terms, and

Â using these two equations it immediately reduces to that simple equation.

Â Relating the counting generating function in the cumulative generating function.

Â The recursive equation that relates those two generating functions.

Â Extremely simple argument.

Â It's actually possible to develop this symbolically and

Â we'll talk about derivations like that in part two.

Â But it's worthwhile having the explicit decomposition in front of us to see

Â how directly we get to the equation that matters, which is the equation relating

Â the generating functions that we can then solve and analyze.

Â 11:33

And then we can just.

Â T(z), remember that's the Catalan, and we know what T(n) is.

Â And T prime is just the derivative of that, which has two terms in it.

Â But there is one thing that simplifies calculations a lot.

Â 1- 2zT(z) If you multiply this by 2z it's just the square root of 1- 4z so

Â that takes care of the denominator.

Â So there is still a little bit of algebra multiplying these two things together but

Â there's lots of cancellations because the square root of 1- 4z times the square root

Â of 1- 4z is just 1- 4z, so I'm not going to pretend.

Â I'm going to admit some of the algebra.

Â But the final result is pretty simple.

Â So that's an explicit equation for the cumulative counting generating function.

Â And what do we need?

Â The coefficient of z to the N in that is the internal

Â path length of all the trees of size N.

Â 12:43

So the take all those binary trees and add up all

Â their internal path lengths and you get four to the vn surprisingly simple.

Â It's not exactly four to vn but it it's four to vn and

Â that's why we want to do precise before, because if you take that and

Â divide it by the Catalan, all you're left with is N squared of pi N.

Â 13:06

We divided two huge quantities, but

Â since we were working with precise results, we get a precise answer.

Â Average internal path link of a random, binary tree,

Â is asymptotic to n square root of pi n.

Â And that's a very accurate estimate [COUGH] for

Â the average internal path link.

Â And we could do it, we could get it exactly, but doing it asymptotically is

Â very compelling, because it's a simple an surprising and unexpected result.

Â 13:42

So, now let's look at the same thing for random minor researches.

Â So, that one explains, or starts to explain the shape

Â of a Catalan tree the nodes go down square root of n.

Â Which in a 10,000 node tree goes down 100

Â a pretty good size depth in that's average so it goes down a few hundred.

Â What about binary search trees?

Â 14:13

Well here's the same calculation for binary search trees.

Â But now we're counting permutations that result in a binary

Â search tree with N nodes and internal path length k.

Â And now it's a little easy because the counting factor is N factorial,

Â we know that.

Â But still the cumulated cost is the same way.

Â It's just that we have to count the number of permutations.

Â So, in this case, some of the trees have internal path length four,

Â five, and six have before, but the weighting is different.

Â So all of these twelve permutations give four.

Â There's only four of them that give five that's these four, and

Â the remaining eight give six.

Â So the total internal path length of the trees that you

Â get from all of the 24 permutations 74.

Â If you divide that by 24, you get 4.833, which is less than what you get for

Â entries because the balanced ones are weighted much more likely.

Â 15:38

So start as usual, but now we have permutations.

Â And so our cost is the length of the permutation.

Â But now, when we say, ipl, we mean, its the ipl of a permutation.

Â Which doesn't make sense, except if you say that the permutation,

Â what we want is the internal path link to the BST,

Â you've got from that permutation by inserting it into a initially empty tree.

Â And again, the number of permutations of size N is N factorial, and

Â that saves us a step in the calculation, as you'll see.

Â Accumulated cost is the total ipl of binary search trees built from all

Â permutations.

Â 16:37

Our cumulative cost, EGF, it's the same symbols as before,

Â except now we have permutations, not trees, and ipl has that different meaning.

Â And so to get the expected internal path length of a binary search tree built

Â from a random permutation, we're going to take N factorial coefficiency,

Â the N in that, divided by N factorial, and those N factorials cancel out.

Â So it's just almost as treating it as an ordinary generating

Â function gives us the divide by N factorial for free.

Â It's just a little calculation trick because we're using exponential generating

Â functions, which means that we divide by N factorial.

Â 17:22

But our probability space is permutations, so we want to divide by N factorial,

Â so it serves both purposes and saves us the step of dividing.

Â We just look for the coefficient of z to the N and C(z).

Â So, next, we have to look at developing a generating function equation for

Â that C(z), using the recursive definition of the binary tree structure,

Â according to the way that we construct trees.

Â 18:11

And then, what we're going to do is use this relationship that we talked about

Â before that gives us all the permutations that now lead to the same tree.

Â And so, that decomposition from a permutation

Â P to a permutation that ipl(p) is,

Â those are all the permutations that give rise to that.

Â And ipl(p), it's got that recursive relationship as before.

Â We can express everything there in terms of pL and

Â pR because of this decomposition that we already talked about.

Â And this one's even simpler than the other one.

Â There's just one trick that often works with exponential generating functions.

Â This (|pL| + |pR| + 1) factorial causes in the denominator,

Â causes a little inconvenience in simplifying this formula.

Â But we have z to that same quantity.

Â So what we do is differentiate to cancel out that one factor.

Â Then you'll see there's a pL + pR factorial on the bottom.

Â And then we can mix that with the binomial coefficient.

Â 19:32

So this is a trick that often works with exponential generating functions.

Â Differentiate, then cancel the pL +

Â pL in the binomial coefficient, and you're left with a pL factorial, pR factorial.

Â And now you've got a very simple double sum

Â that completely decomposes and separates.

Â 19:55

The [COUGH] p of zb/p factorial is just P(z), and

Â P'(z) is the one that takes care of pL/pL factorial,

Â it's (pL-1) factorial, so use this equation.

Â And then again, we get a very simple differential equation now for

Â the cumulative cost exponential generating function for path length in BSTs.

Â 20:23

And decomposition's important.

Â It's gotta be precise, it's gotta be correct.

Â But once you have that decomposition, you automatically get a relationship,

Â that the exponential generating function for

Â the accumulative cost in the counting EGF that you have to satisfy.

Â [COUGH] And again, solving for C(x) gives,

Â it's just simplifying that, substituting in for

Â P(z) gives a very simple equation.

Â So recognize that equation, we saw that in the first lecture actually.

Â That's the equation that came up with solving the quick sort recurrence.

Â I guess it came up in the generating function lecture,

Â pretty much the same equation.

Â 21:15

So, just with that observation,

Â we solve that because it's a first order differential equation

Â that has a simple integration factor and was not difficult to solve.

Â So now we can summarize the analysis of

Â expected length in a BST built from a random permutation on this slide.

Â 21:52

Differentiate to simplify.

Â Substitute to get further simplification.

Â And that is a generating function that has the solution 2/(1-

Â z) squared, log of 1 / 1-z- 2z / (1-z) squared.

Â And those are elementary series that now we can expand to find the coefficients,

Â as we did before, to show that the average internal path length in

Â a binary search tree is asymptotic to 2 and natural log n.

Â So a log n as a factor, if you divide by n,

Â log n to the average node in a binary search tree,

Â square root of n in a binary catalan tree.

Â Since the same equation comes up and

Â people who have had algorithm courses know that there's

Â a bijection between quicksort and binary search trees that explains this.

Â We could have analyzed binary search trees just by taking

Â advantage of this bijection.

Â So that is in quicksort, you have the first entry in the permutation is

Â the partitioning element and the smaller ones and larger ones are mixed.

Â And then after the partitioning, you do them independently.

Â And that's pretty much the same in a binary search tree.

Â The first one is the root, and then you do the left and the right independently.

Â So, you can show that the average number it compares for quicksort is exactly

Â the average external path length to BST built from a random permutation.

Â And that's an interesting bijection to know, to take advantage of.

Â Now this same approach works for a lot of parameters of trees, and there's exercises

Â that compute the number of leaves of trees and other things like that in the text.

Â 23:55

For finding the height of trees, it's much more intricate.

Â It's a different approach because it doesn't break down really as well in that,

Â but that's an interesting problem, because it's a natural thing to

Â want to know what's the furthest node from the root in a tree.

Â That tells us more information about the shape of the tree.

Â So the height derivation is described in the text.

Â And actually for both binary search trees and trees,

Â the height was an open problem for quite a while.

Â So just to summarize what we know about the shapes of

Â these two different tree structures, we looked at random binary trees and

Â binary search trees built from random permutation.

Â Those are typical shapes of those trees.

Â The average path length, that's the average

Â distance to a node in a random tree, for binary trees, it's square root of pi N.

Â For BSTs from random permutation, it's 2 ln N.

Â And again, in the book, there's some description of,

Â although both of these derivations are quite intricate,

Â the height now is known for random binary trees to be twice that average.

Â 25:11

And for random BSTs, it's a little more than twice.

Â It's a constant times ln N, where the constant is about 4.31.

Â And, again, there's lots of things that you might want to study about trees, and

Â I've only really talked in detail about path length.

Â