The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

From the course by Stanford University

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

487 ratings

From the lesson

Week 2

Kruskal's MST algorithm and applications to clustering; advanced union-find (optional).

- Tim RoughgardenProfessor

Computer Science

So, welcome to this optional sequence of videos on state of the art implementations

of the union to find data structure.

Now as advanced in optional material,

let me make a few comments before we get started.

So the first comment is I'm going to assume that you're in a particularly

motivated mood.

Now, no one's forcing you to take these algorithm courses, so

I'm always assuming that you're very motivated individual,

but that's going to hold even more true than usual in these advanced videos.

And the way that plays out is while I'm going to hold myself to the same exacting

standards of clarity that I usually do, I'm going to ask a little bit more of you,

the viewer.

I'm going to perhaps dot a few less Is, cross a few less Ts than I usually do.

So you may find yourself periodically needing to pause and

think through some of the details that I'm glossing over.

The second comment is I'm going to give is sort of short shrift to applications

of the union find data structure.

That's not because there aren't any, there's plenty of applications of this

data structure but for these videos we're really just going to immerse ourselves in

the beauty and the depth of ideas behind the design of the data structures and

especially the analysis of their performance.

The final comment is to keep in mind that this is some seriously

next level material.

It is totally normal that the first time you see this stuff, you find it confusing.

You find it difficult to understand.

So confusion should not discourage you.

It does not represent any intellectual failing on your part.

Rather, keep in mind, it represents an opportunity to get even smarter.

So, with those comments in mind, let's go ahead and

proceed to a different approach to the Union-find data structure via lazy unions.

So let's have one slide with a quick review of what we already know about

the Union-find data structure.

Recall we discussed this in the context of a fast implementation of

Kruskal's minimum spanning tree algorithm The raison d'etre of a union-find

data structure is to maintain a partition of a universe of objects.

So in the context of Kruskal's algorithm, the objects were the vertices, and

the groups we wanted to maintain were the connected components with respect to

the edges we committed ourselves to so far.

The data structure should support two operations.

No prizes for guessing the names of those two operations.

First is the find operation.

This is given an object from the universe return the name of that object's group.

So for example if X was in the middle of this square that I put

on the right representing the universe capital X we would return a C3.

The name of the group that contains X.

So in the context of Kruskal's algorithm, we use find operation to check for cycles.

How did we know whether adding a new edge would create a cycle with respect to

the edges we've already chosen?

Well, it would be if and

only if the end points of that edge were already in the same connected component.

That is, if and only if the two find operations returned the same answer.

In the Union Operation you're given out one object but you call them x and y,

and then the responsibility of the operation is to merge the groups

that contain x and y.

So for example in the picture y was in C4, x was as before in C3.

Then your job is to fuse the groups C3 and C4 into a single group.

In the context of Kruskal's algorithm,

we needed this operation when we added a new edge.

This fused together two of the existing connecting components into one.

So that was exactly the union operation.

So we all ready went through one implementation of union and find.

That was order to get a blazingly fast implementation of Kruskal's algorithm.

So let's just review how that works.

With each group, we associated a linked structure, so

each object had one pointer associated with it.

And the invariant was in a given group,

there was some representative leader object.

And everybody in that group pointed to the leader of that group.

So for example, on the right I'm showing you a group with three objects, x, y, and

z, and x would be the leader of this group.

All three of the objects point directly to x, and

that would just be the output of the find operation, the leader of the group.

We use that as the group name.

Now the part which is cool and

obvious about this approach is our find operations take constant time.

All you do is return the leader of the given object.

Now the tricky part was analyzing the cost of Union operations.

So the problem here is that to maintain the invariant that

every object of a group points to its leader.

When you fuse two groups, you have to update a bunch of the objects' leaders.

We'd only have one leader for the one new group.

The simple, but totally crucial, optimization that we discussed was

when two groups merge you update the leader pointers of all objects in

the smaller group to point to the leader of the bigger group.

That is the new fused group inherits the leader

from the bigger of its constituent parts.

If we do that optimization, it's still the case that a single union might take linear

time fade event time but a sequence of n unions takes only big o of n log in time.

And that's because each object endures at most a logarithmic number of leader

updates, cause every time it's leader pointer gets updated the population

of the group that it inhabits doubles.

So in this sequence of videos we are going to discuss a different approach to

implementing the union find data structure.

And let's just go all in and

try to get away with updating only a single pointer in each union.

All right, well how are we going to do that?

Well, I think if we look at a picture,

it'll be clear what the approach is going to be.

So let's look at a simple example.

There's just six objects in the world and currently they're in two groups.

One, two, and three are in one group with leader one.

Four, five, and six are in the second group with leader four.

So, with our previous implementation of union find and the two groups have equal

size, so we pick one arbitrarily to represent the new leader.

Let's say four is going to be the new leader.

And now we update objects one, two, and three so

that they point directly to their new leader, four.

And so the new idea is very simple let's just sort of as a short hand for

re-wiring all of one, two and three to point to four,

let's just re-wire one pointer to point to four and then it's understood that two and

three as descendants of one also now have the new leader four as well.

So as a result, we do again get a directed tree afterwards,

but we don't get one with just depth of one.

We don't get as shallow, pushy a tree.

We get a deeper one that has two levels below the root.

So, let me give you another way of thinking about this in terms of an array

representation.

And here I also want to make a point that why conceptually it's going

to be very useful to think of these union find data structure in terms of these

directed trees.

You're actually implementing this data structure this is not how you'd do it.

You wouldn't bother with actually pointers between the different objects.

You'd just have a simply array indexed by the nodes.

And in the entry corresponding to a node i you would just store the name

of i's parent.

For instance if we reframe this exact same example in an array representation

before we do the union what do we got?

Well objects one, two, and three all have parent one.

Objects four, five, and six all have parent four.

So in the array we'd have a one.

In the first three entries and a four in the second three entries.

So in the old solution we have to update the parent pointers of objects one two and

three, they're all going to point directly to four, so

that gives us an array entirely of fours.

Whereas in the new solution the only node whose parent

pointer gets updated is that of object one, it gets updated from itself to four.

And in a particular objects two and

three still point to object one not directly to object four.

So that's how union works so in simple example.

How does it work in general?

Well, in general, you're given two objects, each belongs to some group.

You can think of these two groups conceptually as directed trees.

If you follow all the parent corners, all parent corners lead to the root vertex.

We identify those root vertices with the leaders of these two groups.

Now when you have to merge the two trees together, how do you do it?

Well, you pick one of the groups and

you look at its roots currently it points to itself.

You can change its parent pointer to point to the other groups leader.

That is you install one of the two roots as a new child of the other trees root.

So I hope the pros and cons of this alternative approach to the union find

data structure are intuitively clear.

The win, the pro, comes in the union operation, which is now very simple.

Now you might be tempted to just say union takes constant time,

because all you do is update one pointer, It's actually not so simple, right?

Because remember you're given two objects, x and y, and

in general you're not linking x and y together.

They might be somewhere deep in a tree.

You're linking roots together.

So what is true is that the union operation reduces two

two indications of find.

You have to find x's root.

R sub 1, you have to y's root r sub 2,

and then you just do constant time linking either r1 to r 2 or vice versa.

Now the issue with this lazy union approach is that it's not at all obvious,

and in fact it's not going to be true, that find operation takes constant time.

Remember, previously when we actually did this hard work of updating all these

leader pointers every time we had a union it guaranteed that whenever there is

a find boom, we just look in the field, we return the leader pointer, we're done.

Here, parent pointers do not point directly to roots rather

you have to traverse in general a sequence of parent pointers from a given object x

to go all the way up to the root of the correspondence tree.

So because of these trade offs.

These pros and cons, it's really not obvious at all whether this

lazy union approach to the union find data structure is a good idea.

Whether it's going to have any pay offs.

So that's going to take really some quite subtle analysis which is the main

point of the lectures to come.

It's also going to take a couple of optimizations and

the next video will start with the first one.

You might be wondering, okay, so when you do a union and

you have to install one root under the other, how do you make that choice?

So the right answer is something called union by rank.

That's coming up next.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.