0:19

So, now we're going to extend the API to talk about two-dimensional keys.

Â So that's just, you can think of two-dimensional keys as a points in

Â two-dimensional geometric space.

Â We're going to talk about insertion and search.

Â Now, we want to talk about deletion and then range search and range count.

Â So we want to be able to insert and delete points,

Â you can think of a two-dimensional key as a point in two-dimensional space.

Â And we want to be able to find all keys that lie within a two-dimensional range,

Â that's a rectangle, as I mentioned at the beginning.

Â Or count the number of keys that lie in a two-dimensional range.

Â So again, the geometric interpretation is the keys are points in the plane.

Â And we have a 2d range is a rectangle oriented

Â to align with the horizontal vertical axis.

Â And we want to be able to find or count the points in a given rectangle.

Â And this one has many, many applications, and

Â we'll talk about some of them later on.

Â And even if it's not points in the plane, just databases.

Â You might ask for all the people with incomes between 1 million and

Â 10 million who are between 40 and 50 years of age.

Â And this kind of algorithm,

Â a data structure would be useful for that kind of situation too.

Â So how are we going to solve this problem, implement this API.

Â We build a data structure containing points that can efficiently support

Â range searching and range counting.

Â 2:04

Well one easy way to do it is to just think about

Â dividing space into a grid of squares.

Â So, and I will pick a parameter M and

Â divide space into an M-by-M grid of squares.

Â And then process all the points, and

Â make a list of points that are contained in each square.

Â We can use a two-dimensional array to directly index relevant squares.

Â And for insert, you just take (x,y), figure out which square it belongs to.

Â Simply divide both coordinates by M, and then look in the two-dimensional array.

Â And just to add the point to the list for corresponding square.

Â And then range search is only find the squares that intersect the query and

Â process the points in that square.

Â 2:55

And, depending on the value of the parameter, M,

Â you have a space time trade-off.

Â The amount of space required is N squared, for the grid plus N.

Â You have to have a linked list element or whatever for each point.

Â But then the time, though, gets divided by M squared,

Â 3:17

your number of points, M, are spread out over the M squared different squares.

Â And so on average, you examined N over M squared points per square.

Â So you don't want to make M too big, it'd be too much space.

Â You don't want to make M too small, it'd be too much time.

Â 3:38

So what we want to choose is this square size that would best balance these two

Â needs.

Â And then, it's easy to see that what you should choose is

Â M to be about square root of N.

Â So then your space is within a constant factor of N and

Â your time is constant.

Â So if the points are randomly distributed, then this is ideal.

Â It takes this linear time to initialize the data structure.

Â And to insert and search, it takes constant time per point in the range.

Â And this is absolutely a fine method that is not that difficult to implement,

Â in the case that the points are evenly distributed.

Â 4:22

Unfortunately, it's usually the case

Â in geometric data that the points are not evenly distributed.

Â It's a well-known phenomenon known as clustering that says that

Â the points aren't going to be evenly distributed all over the whole thing.

Â In the case of the grid implementation, they might all fall in the same square.

Â And so

Â the average list length is short, this is like what we encountered with hashing.

Â If you take all the points in one square,

Â in 0 and all the rest of them, your average is still N over M squared.

Â But they're all in that long list and

Â you're going to have a slower algorithm if it's based on this.

Â So we need a data-structure that more gracefully adapts to

Â the distribution of the data.

Â And again, it's well known that most geometric data has this kind of problem.

Â So for example, here's some data which is cities in the US.

Â It's got 13,000 points, but if you try to use the grid implementation for

Â this you find that half the squares would be empty and

Â half the points are in just 10% of the squares.

Â So the clustering in the data is going to make the implementation inefficient.

Â We need to adapt to the data.

Â And this is very, very typical in geometric data,

Â particularly in higher dimensional data, as we'll see in a minute.

Â So people have developed all different kinds of methods for adapting in this way.

Â And what we're going to look at as one of the most widely used,

Â which is basically to use a tree to represent a recursive

Â subdivision of the plain, of two-dimensional space.

Â It's going to be recursive, it's going to be based on the points,

Â the way in which we divide into halfplanes.

Â In this one of many different algorithms that have been studied for this.

Â But again it's a simple one and widely used.

Â So for example, if you've played the game Doom or

Â used a flight simulator that these types

Â of graphical simulations and animations are made possible.

Â Not only through the use of space-partitioning trees

Â like 2d trees and quadtrees.

Â And also in all different types of scientific data processing,

Â 7:05

these things are extremely important.

Â Whenever you're processing geometric data,

Â you're doing some kind of geometric search.

Â Where's the closest thing?

Â How am I going to find the closest thing efficiently?

Â What things are nearby and so forth.

Â So rest assured these types of algorithms lie at the heart of

Â any program you use that is involving a lot of geometric data.

Â So those are just two examples.

Â 7:36

So let's look at how it looks now..So 2d tree again,

Â it's going to be a data structure based on a bunch of

Â points that's going to facilitate efficient data processing at these points.

Â So, just as we do for symbol tables where we take keys,

Â now we're going to take points.

Â And we're going to build a data structure based on these points.

Â And the idea is to

Â build a tree that corresponds to recursively partitioning the plane.

Â So arbitrarily our first point, we're going to

Â divide the plane into two parts based on a vertical line through that point.

Â So now in the tree on the right there, all

Â the points that fall to the left of the first plane are going to be on the left.

Â And all the points that fall to the right of the first

Â plane are going to be on the right.

Â But then we get to the next point, we'll switch and

Â we'll partition on a horizontal line.

Â 8:51

Similar, if our third point comes on the left, again, we'll partition

Â according to the horizontal line through that point on the left.

Â So if we go left and then left, that means all the points to the left of 1 and

Â above 3, so the square in the upper left is represented by that node in the tree.

Â Again, now when we go to one level below, we switch again to vertical.

Â Alternate between horizontal and vertical partitioning of the plane.

Â So it's a regular binary search tree but it's got this interpretation

Â based on the geometric data, where we switch which key we use for

Â the comparison, the x coordinate or the y coordinate at each level.

Â And that corresponds to this partitioning of the plane.

Â So now 5 comes in, that's to the left of 4 because it was partitioned at a vertical

Â and 5's going to partition in a horizontal.

Â And this is simple and completely well

Â defined partitioning of the plane

Â corresponding to a binary tree.

Â Now the 9th point well it's to the left of 8, above 2 to the left of 8 and

Â then corresponds to a horizontal partitioning.

Â 10th point is to the right of 1, it's below 2 so

Â we go to the left and it's to the right of 7, so we go to the right.

Â So that's a way to build a binary tree

Â corresponding to a partitioning of the plane.

Â And it's really the same as a binary search tree,

Â it's just that we alternate which coordinate we use as the key.

Â At the even levels, we think of a vertical line, and the left sub-tree is

Â all the points to the left and the right sub-tree is all the points to the right.

Â On odd levels we use a horizontal line and the left sub-tree is all points below and

Â the right sub-tree is all points above.

Â 10:57

And the code for this is the same as for binary search trees.

Â It's simply which coordinate we use for the comparison that's the only difference.

Â So that's 2d tree implementation.

Â So now what about solving a problem like this, range search problem for a 2d tree.

Â So now we have a query like this green rectangle and what we want

Â to find is all of the points in the data structure that fall within that rectangle.

Â Well, we're going to use the 2d tree represents our points and

Â we're going to use the structure and definition of that tree to go ahead and

Â help us find the points that are in the rectangle.

Â 12:07

All right, so we're going to try to find all the points that are contained in that

Â green query rectangle.

Â So first thing is to check if our rectangle contains a node of the root,

Â in this case it doesn't.

Â So since its to the left of the splitting line of the root,

Â we only have to search in the left sub-tree.

Â Now we search the left sub-tree and we want to check if it contains point 3.

Â That does not contain point 3 but now which sub-tree do we search?

Â [COUGH] In this case, now the rectangle intersects

Â our splitting line, so we have to search both sub-trees, both above and below.

Â 12:49

So first we search the left sub-tree that's the one below.

Â Doe it contain point 4?

Â No.

Â It's to the left, so we only have to search the left sub-tree of point 4.

Â And so we search to the left sub-tree and we check if it contains point 5 and

Â it does, that's the one that we return.

Â It also intersects the splitting line, so we have to search both the sub-trees.

Â In this case they're both empty, so we're done with that.

Â But now we have to go back and we have to search the other sub-tree of point 3.

Â 13:26

And that's the above.

Â So now we check if point 6 in the rectangle, in this case, it's not.

Â And is to the left, so we're going to have to search the left sub-tree of point 6 and

Â that one is empty and now we return and we're done.

Â So we don't always go down just one branch.

Â If our splitting line hits our rectangle, we have to go down both branches.

Â But still, this is a very efficient algorithm.

Â Particularly, think about the rectangle being small,

Â it's going to be not that different than a regular search in a binary search tree.

Â 14:03

All right, so what about the analysis of how long is this is going to take?

Â Well, again, a typical case, rectangles are small that we're only going to examine

Â really the path of the tree maybe a couple of other nodes along with path.

Â And the running time will be proportional to the number of points returned,

Â plus log N.

Â With geometric data, the worse case can be bad, so

Â like all the points could be arranged in a circle.

Â All different types of problems might occur and with some difficulties,

Â it's possible to prove that even if the tree's balanced,

Â you can get a worse case proportional to square root of N.

Â So analysis of 2d trees is a bit beyond our scope.

Â But for many practical applications,

Â they're easy to implement and worth using.

Â Let's look at another using 2d trees to solve another problem,

Â the so-called nearest neighbor search.

Â So now instead of a rectangle we have a query point and

Â our goal is to find the closest point to that point.

Â So in this case our query point is over here in green and our algorithm's

Â going to want to return point 5, that's the closest one to the query point.

Â So let's see how that looks in a demo.

Â 15:40

Whenever we're at a node, it represents a point, so we're going to check that point

Â and we'll compute the distance from that point to our query point.

Â And if that distance is less than the best found so far,

Â then we'll keep that as the champion.

Â So the first point that's the closest we found so far to the query point, so

Â we'll save our number 1 as the distance.

Â And we'll only worry about the possibility of finding something closer than that.

Â And so just using that distance, we recursively search

Â any part of the tree that could contain a closer point.

Â [COUGH] And so that's what we'll continue to do.

Â So in this case, the query point is to the left of the splitting line.

Â And we'll always go towards the query point first.

Â So in this case, we have to search both,

Â there might possibly be a closer point than 1 over in the right.

Â If you come straight across, there might be a closer point.

Â We're going to have to look at both, as far as we know now.

Â But we'll go towards the query point and see if we can find something closer.

Â So in that case now we go to point 3,

Â compute the distance of that point to the query point.

Â It's closer, so we update 3 to be our new champion.

Â And so now we're only going to look in parts of the tree that

Â could give us a point that's closer to our query point than 3.

Â Notice already that will mean when we get back to 0.1,

Â we won't search the right subtree.

Â because there could be no point on the right subtree,

Â on the right of this splitting line, that's closer to the query point than 3.

Â And so that idea of getting closer and closer to the query point is going to cut

Â out different parts of the tree as we process.

Â 17:41

So, but anyway, starting at point 3, as far as we know,

Â we're going to have to look at both subtrees.

Â So sometimes we look at both subtrees, but as we get closer and

Â closer, we only look at 1.

Â So let's look at point 3 now.

Â 17:56

So again, go towards the query point, so

Â I'll go to the top first, and that takes us to 6.

Â 6 is not any closer than 3 was, so that's not going to update our champion.

Â [COUGH] And so, we'll search 6 is left subtree,

Â which is empty, and then its right subtree.

Â And while the nearest neighbor can't be,

Â we don't have to go down the right subtree at 6 because you can't have a point in

Â that rectangle that's closer to the query point than 3.

Â So, now we can return from that, and

Â now we have to look at the bottom subtree associated with 3.

Â And so, that takes us to 4, and that one is not closer, so

Â we still have 3 as our current champion.

Â 18:49

So now we'll search the left subtree of 4 first because

Â the query point is to the left of that splitting line.

Â And that takes us to 5, and 5's our new champion.

Â So that's the closest point that we know about.

Â Could there be a node that's closer to our

Â query point than 5 in the right subtree of 4?

Â We have to go above, sorry, to look at the top subtree associated with 5 and

Â we find that it's empty.

Â And now we're back at 4, do we have to search the right subtree of 4?

Â No, because there can't be a closer point than 5 in the right subtree of 4.

Â So we're done with 4, and we come to 3.

Â And now we, supposed to search and

Â return from there, we're now at 1.

Â Supposed to search the right subtree at 1 next, but we can prune that.

Â Nearest neighbor couldn't be in there.

Â So then we're done, and we found that the nearest neighbor is 5.

Â And this is going to be very efficient because as we get closer and

Â closer to the query point, we're cutting out all the sub queries that are away.

Â And again, in practice,

Â the running time of this algorithm is going to be close to logarithmic.

Â So in typical cases, the running time of nearest neighbor search

Â in a 2D tree is going to be proportional to logarithmic.

Â It is possible to concoct cases,

Â where you're going to have to examine all the points.

Â For example, if they're all arranged in a circle and

Â your query point's in the center or something of that sort.

Â But for typical data, it's very efficient.

Â 20:40

Now we're going to look at an application where we simulate a phenomenon in nature,

Â and this is, what kind of patterns do things like starlings and

Â geese or cranes, or fish, or fire flies.

Â How do they flock together?

Â And we'll look at a simulation that corresponds to that.

Â >> And then, when the moment is right,

Â they behave in a way that should be impossible.

Â [MUSIC]

Â 22:47

And it happens everyday, right through the winter,

Â just a couple of miles from my doorstep.

Â How good is that?

Â [MUSIC]

Â >> So, this is a simple model developed by Craig Reynolds a while ago for

Â simulating the situation called the boid.

Â And the idea is to use three simple rules,

Â 23:40

So, as that example shows, 2D trees are extremely effective

Â in quickly processing huge amounts of geometric data.

Â And what's more, it expand to more dimensions.

Â With a very simple modification, we can take a 2D tree and

Â create a data structure known as a Kd tree, which even works for K dimensions.

Â And the idea is, even if there's K dimension,

Â what we'll do is recursively partition one dimension at a time.

Â 24:10

So that's called a Kd tree.

Â And we use the same idea as for 2d trees, but instead of just cycling through

Â horizontal vertical, we cycle through however many dimensions there are.

Â So it's in three space, we use a plane and do above and

Â below, and then simply cycle through the dimensions.

Â At level i, we put on the left the points whose ith coordinates are less than p.

Â And on the right, we put the points whose ith coordinates are greater than p.

Â And cycle through the dimensions of level i mod k.

Â We just use that dimension of the point to do the comparison.

Â The implementation is simple, except for the comparison.

Â And we get the same kind of partitioning for three dimensional data, so

Â we could do Boyd's in three dimensions.

Â Or for databases with large number of dimensions,

Â you could do even much higher dimensional data and

Â find nearest neighbors and do range searching extremely efficiently.

Â It's a very efficient and simple data structure for

Â processing k dimensional data that's very widely used and

Â the whole idea is that data clusters, particularly in high dimensions.

Â And also one point to make for this class is that this algorithm was

Â discovered by an undergraduate in an algorithm school.

Â So that's John Bentley who discovered this while an undergraduate at Stanford.

Â And so it's a simple idea but expert scientists were struggling

Â with dealing with huge amounts of geometric data.

Â And Bentley found this way to process it efficiently that's

Â been widely used ever since.

Â In particular, just as another example, consider the idea of N-body

Â simulation, which is a classic problem in physics

Â where you've got N particles mutually affected by gravity.

Â And basically, the computation is based on

Â computing the interacting force for each pair of particles.

Â And so, then there'll be mutual gravitational pull and

Â 26:39

this is what happens with a large number of particles in a certain simulation.

Â And people understand properties of the universe by doing these kinds of

Â calculations and comparing against what's observed in space.

Â Now, but the thing is, for each pair of particles, so

Â if you have N particles and you have to do it for each pair, that's N squared.

Â So the progress of a scientific investigation is going to be affected by

Â how quickly you can do this calculation for a large number of particles.

Â There's a lot of particles out in the universe and

Â you can't do a quadratic calculation for large N.

Â So, another undergraduate in an algorithms class discovered this idea for

Â N-body simulation and that's Andrew Appel.

Â And his idea was that if some particle is way away from some cluster of particles,

Â we can treat that cluster as a single aggregate particle.

Â And not do the individual force calculation between our particle and

Â every one of those in the aggregate.

Â But use the center of mass and you get a very accurate

Â [COUGH] approximation to the N-body doing that.

Â And the algorithm that he used is based on 3d-trees with the N

Â particles as nodes, and storing the center of the mass in the subtree in each node.

Â And then, [COUGH] to compute the total force traversing the tree

Â of all the information that you need, to complete the N-body calculation.

Â But in time, much closer to N log N than to N squared.

Â And that idea that I didn't need to take time proportional to N squared.

Â But with a geometric algorithm like a 3d-tree you could get the time

Â to n log n that enabled all sorts of new scientific investigation

Â in this example of the use of algorithms to enable new research.

Â