So, by now we've spent quite a bit of time talking about the minimum cost
spanning tree problem. There's a number of motivations why we do
that. the first, it's just a uniquely great
problem for the study of greedy algorithms.
You can propose a bunch of different greedy algorithms, and quite unusually,
it's all of them seem to work. So you get correct greedy algorithms, but
it's quite subtle what's driving their correctness.
You also get a lot of practice, arguing about graphs, and arguing about exchange
arguments, and so on. The second reason, it's been worthwhile
spending time on these algorithms. It has given us some more practice with data
structures and how to use them to speed up algorithms,
namely heaps, to speed up Prim's algorithm,
to union-find data structure to speed up Kruskal's algorithm.
The third reason that we're talking about them is they have applications in their
own rights. That's the subject of this video and the
next and we're going to focus on applications to clustering problems.
So let me begin by just talking about the goal of clustering informally.
And then, I'll let you pin me down to a precise objective function on the next
slide. So in a clustering problem, the input is
n points that we think of as being embedded in space.
And it's actually quite rare that in the underlying problem that we care about is
it actually intrinsically geometric? Is it actually intrinsically points in
space? Usually, we're representing something we
care about. Maybe it's web pages.
Maybe it's images. Maybe it's a database as points in space.
And given a bunch of objects, we want to cluster them into, in some sense coherent
groups. For those of you coming from a machine
learning background, you'll often hear this problem referred to as unsupervised
learning, meaning the data is unlabeled. We're looking for just patterns and data
where the data is not annotated. This obviously is a fairly wishy-washy
description of a problem, so let's be a little bit more precise.
We're going to assume that part of the input is what we're going to call a
similarity measure. Meaning for any two objects, we're
going to have a function giving us a number indicating how similar or really
rather how dissimilar they are to each other.
In keeping with the geometric metaphor, we're going to refer to this function as
a distance function. One thing that's cool is we don't need to
impose many assumptions on this distance function.
The one thing we're going to assume is that it's symmetric,
meaning the distance from p to q is the same as the distance from q to p.
So what are some examples? Well, if you want to really go with the
geometric metaphor, if you're representing these points as in a space
RM for some dimension M, you can just use the Euclidean distance or if you prefer
some other norm, like say, l1 or l-infinity.
In many application domains, there are widely accepted similarity or distance
measures. one example would be for sequences, as we
discussed in introductory lecture the penalty of the best alignment between two
genome fragments. So now that we have this distance
function, what would it mean to have coherent groups?
Well, things which have small distance from each other which are similar, they
should generally be in the same group, and things that are very dissimilar that
have large distance between them, you would expect to mostly be in different
groups. So how can we evaluate how good a
clustering is, how well it's doing with the job of putting nearby points together
and dissimilar points in different groups?
Well, to be honest there's many ways of approaching and formalizing that problem.
The approach we're going to take is an optimization-based approach.
We're going to positive and objective function on clusterings and then seek out
the clustering that optimizes that objective function.
I want to warn you that's not the only way to approach the problem.
There are other interesting approaches, but optimization is a natural one.
Furthermore, just like in our scheduling application, there's more than one
objective function that people study and that is well-motivated.
So one very popular objective function would be, say the k-means objective,
which I encourage you to look up and read more about.
For this lecture, I'm just going to adopt one specific objective function.
It's natural enough, but it's by no means the only one or even the primary one.
But it'll serve the purpose of studying a natural greedy algorithm related to
minimum spanning tree algorithms which is optimal in a precise sense.