So, uh, in order to sort
a fairly large data set in MapReduce,
um, the input receives a series of key/value pairs,
and for the output
you want to, uh, output the set of sorted values.
The Map function, uh, that we would write
would take the key/value pairs
and it would simply output value comma whatever
as the key/value pair.
We don't really care
what the value here is for the output key/value pair.
Essentially this is the identity function;
it does not do any processing.
Similarly, the Reduce also does no processing;
it just outputs the same key/value pair.
You might wonder "How does this work?"
Well, it works because the Map output is already sorted
using Quicksort,
and the Reduce output is also sorted using Mergesort,
and so the output of, uh, the Reducers,
as long as the Reducers are also sorted
with respect to each other
and you assign the file names
for each of the Reducer outputs appropriately,
uh, you will get output that is also, uh, sorted, um,
based on the values.
The only concern here is one of partitioning.
You cannot use hash-based partitioning over here.
You want to make sure that the Reducers themselves
are also ordered so that their output-
corresponding output file names can be given increasing
um, um, uh, file names, uh, in increasing order,
and they can be collated appropriately.
So, uh, you want to partition keys across Reducers
based on ranges.
In other words, you want to assign each Reducer
a range of keys,
say Reducer number one gets, uh, uh, keys 0 through 1,000, uh,
Reducer number two gets, uh, keys 1,001 through 2,000,
and Reducer number three gets keys 2,001 to-through 3,000
and so on and so forth.
So, um, this again goes into the partitioning function.
Ah, in addition,
the, uh, values may be, uh, non-uniformly distributed,
so there might be many more keys
in, uh, the range say 2,001 through 3,000,
so you may want to assign more Reducers from that range.
Okay, and so you want to-you may want
to take the data distribution into account
in order to assign, uh, while assigning ranges, uh,
to the keys, uh, to the, uh, Reduce tasks.