The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

From the course by Stanford University

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

248 ratings

Stanford University

248 ratings

Course 3 of 4 in the Specialization Algorithms

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

From the lesson

Week 2

Kruskal's MST algorithm and applications to clustering; advanced union-find (optional).

- Tim RoughgardenProfessor

Computer Science

So, welcome to this optional sequence of videos on state of the art implementations of the union to find data structure. Now as advanced in optional material, let me make a few comments before we get started.

So the first comment is I'm going to assume that you're in a particularly motivated mood. Now, no one's forcing you to take these algorithm courses, so I'm always assuming that you're very motivated individual, but that's going to hold even more true than usual in these advanced videos. And the way that plays out is while I'm going to hold myself to the same exacting standards of clarity that I usually do, I'm going to ask a little bit more of you, the viewer. I'm going to perhaps dot a few less Is, cross a few less Ts than I usually do. So you may find yourself periodically needing to pause and think through some of the details that I'm glossing over. The second comment is I'm going to give is sort of short shrift to applications of the union find data structure. That's not because there aren't any, there's plenty of applications of this data structure but for these videos we're really just going to immerse ourselves in the beauty and the depth of ideas behind the design of the data structures and especially the analysis of their performance.

The final comment is to keep in mind that this is some seriously next level material. It is totally normal that the first time you see this stuff, you find it confusing. You find it difficult to understand. So confusion should not discourage you. It does not represent any intellectual failing on your part. Rather, keep in mind, it represents an opportunity to get even smarter. So, with those comments in mind, let's go ahead and proceed to a different approach to the Union-find data structure via lazy unions.

So let's have one slide with a quick review of what we already know about the Union-find data structure. Recall we discussed this in the context of a fast implementation of Kruskal's minimum spanning tree algorithm The raison d'etre of a union-find data structure is to maintain a partition of a universe of objects. So in the context of Kruskal's algorithm, the objects were the vertices, and the groups we wanted to maintain were the connected components with respect to the edges we committed ourselves to so far.

The data structure should support two operations. No prizes for guessing the names of those two operations. First is the find operation. This is given an object from the universe return the name of that object's group.

So for example if X was in the middle of this square that I put on the right representing the universe capital X we would return a C3. The name of the group that contains X.

So in the context of Kruskal's algorithm, we use find operation to check for cycles. How did we know whether adding a new edge would create a cycle with respect to the edges we've already chosen? Well, it would be if and only if the end points of that edge were already in the same connected component. That is, if and only if the two find operations returned the same answer.

In the Union Operation you're given out one object but you call them x and y, and then the responsibility of the operation is to merge the groups that contain x and y. So for example in the picture y was in C4, x was as before in C3. Then your job is to fuse the groups C3 and C4 into a single group.

In the context of Kruskal's algorithm, we needed this operation when we added a new edge. This fused together two of the existing connecting components into one. So that was exactly the union operation.

So we all ready went through one implementation of union and find. That was order to get a blazingly fast implementation of Kruskal's algorithm. So let's just review how that works.

With each group, we associated a linked structure, so each object had one pointer associated with it. And the invariant was in a given group, there was some representative leader object. And everybody in that group pointed to the leader of that group.

So for example, on the right I'm showing you a group with three objects, x, y, and z, and x would be the leader of this group. All three of the objects point directly to x, and that would just be the output of the find operation, the leader of the group. We use that as the group name.

Now the part which is cool and obvious about this approach is our find operations take constant time. All you do is return the leader of the given object. Now the tricky part was analyzing the cost of Union operations. So the problem here is that to maintain the invariant that every object of a group points to its leader. When you fuse two groups, you have to update a bunch of the objects' leaders. We'd only have one leader for the one new group.

The simple, but totally crucial, optimization that we discussed was when two groups merge you update the leader pointers of all objects in the smaller group to point to the leader of the bigger group. That is the new fused group inherits the leader from the bigger of its constituent parts. If we do that optimization, it's still the case that a single union might take linear time fade event time but a sequence of n unions takes only big o of n log in time. And that's because each object endures at most a logarithmic number of leader updates, cause every time it's leader pointer gets updated the population of the group that it inhabits doubles.

So in this sequence of videos we are going to discuss a different approach to implementing the union find data structure.

All right, well how are we going to do that? Well, I think if we look at a picture, it'll be clear what the approach is going to be.

So let's look at a simple example. There's just six objects in the world and currently they're in two groups. One, two, and three are in one group with leader one. Four, five, and six are in the second group with leader four. So, with our previous implementation of union find and the two groups have equal size, so we pick one arbitrarily to represent the new leader. Let's say four is going to be the new leader. And now we update objects one, two, and three so that they point directly to their new leader, four.

And so the new idea is very simple let's just sort of as a short hand for re-wiring all of one, two and three to point to four, let's just re-wire one pointer to point to four and then it's understood that two and three as descendants of one also now have the new leader four as well.

So as a result, we do again get a directed tree afterwards, but we don't get one with just depth of one. We don't get as shallow, pushy a tree. We get a deeper one that has two levels below the root.

So, let me give you another way of thinking about this in terms of an array representation. And here I also want to make a point that why conceptually it's going to be very useful to think of these union find data structure in terms of these directed trees. You're actually implementing this data structure this is not how you'd do it. You wouldn't bother with actually pointers between the different objects. You'd just have a simply array indexed by the nodes. And in the entry corresponding to a node i you would just store the name of i's parent. For instance if we reframe this exact same example in an array representation before we do the union what do we got? Well objects one, two, and three all have parent one. Objects four, five, and six all have parent four. So in the array we'd have a one. In the first three entries and a four in the second three entries.

So in the old solution we have to update the parent pointers of objects one two and three, they're all going to point directly to four, so that gives us an array entirely of fours. Whereas in the new solution the only node whose parent pointer gets updated is that of object one, it gets updated from itself to four.

So that's how union works so in simple example. How does it work in general? Well, in general, you're given two objects, each belongs to some group. You can think of these two groups conceptually as directed trees. If you follow all the parent corners, all parent corners lead to the root vertex. We identify those root vertices with the leaders of these two groups. Now when you have to merge the two trees together, how do you do it? Well, you pick one of the groups and you look at its roots currently it points to itself. You can change its parent pointer to point to the other groups leader. That is you install one of the two roots as a new child of the other trees root.

So I hope the pros and cons of this alternative approach to the union find data structure are intuitively clear. The win, the pro, comes in the union operation, which is now very simple. Now you might be tempted to just say union takes constant time, because all you do is update one pointer, It's actually not so simple, right? Because remember you're given two objects, x and y, and in general you're not linking x and y together. They might be somewhere deep in a tree. You're linking roots together. So what is true is that the union operation reduces two two indications of find. You have to find x's root. R sub 1, you have to y's root r sub 2, and then you just do constant time linking either r1 to r 2 or vice versa. Now the issue with this lazy union approach is that it's not at all obvious, and in fact it's not going to be true, that find operation takes constant time. Remember, previously when we actually did this hard work of updating all these leader pointers every time we had a union it guaranteed that whenever there is a find boom, we just look in the field, we return the leader pointer, we're done. Here, parent pointers do not point directly to roots rather you have to traverse in general a sequence of parent pointers from a given object x to go all the way up to the root of the correspondence tree.

So because of these trade offs. These pros and cons, it's really not obvious at all whether this lazy union approach to the union find data structure is a good idea. Whether it's going to have any pay offs. So that's going to take really some quite subtle analysis which is the main point of the lectures to come. It's also going to take a couple of optimizations and the next video will start with the first one. You might be wondering, okay, so when you do a union and you have to install one root under the other, how do you make that choice? So the right answer is something called union by rank. That's coming up next.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.