I select the next break is 2,
so then that's how my elements break out.
So this might be my light red to my dark red color for example.
And so that's the maximum break method.
Natural breaks are when we examine it by hand and
try to logically determine breaks within the data.
So our goal is to maybe minimize the differences between
data values in the same class and maximize the differences.
So we're trying to take into account natural underlying structures.
This is subjective though.
Since the mapmakers sort of doing this by hand,
different mapmakers may choose other values.
But you can see a lot of this is in
our maximum natural breaks is maybe an optimization problem.
Here, we're trying to optimize the distance between two different map elements,
and so we can think about maybe even applying some of
our data mining classification methods
to try to split this into a certain number of classes.
And this optimal classification is similar to
natural breaks but we're trying to minimize an objective function.
So we create some measure error.
We have Jenks-Caspall, Fisher-Jenks for example.
What we're trying to do is maybe calculate the median and the number of sums above
the standard deviation or try to come up with some other optimization criteria.
And this allows us to then create our own formula for this classification.
And so this is one of the other major class of classification algorithms and hopefully,
you're starting to see the connection between histograms,
data mining, classification methods we've shown
computer optimization for trying to decide essentially how can we make
a chloropleth map that has a nice spread between colors and
values to allow us to perceive different regions and areas within the data.
Advantages of optimal classification include that they are
good empirical method for grouping data that can
assist in determining the appropriate number of classes,
but often,it's hard to explain to novice users and we still may
wind up with gaps in the map legend and we really don't want that to occur.
So these are some of the different map options.
And with optimal mapping,
we can also then even think about adding
different spatial constraints and we can even do this with Jenks natural breaks as well.
So what do I mean by adding different spatial constraints.
Well, let's think about a simple map maybe something
looks like this where these are different counties that I've shown here.
And let's say that I've created a three-class legend,
light red, red, and dark red.
Okay? So I could make this is dark red,
light red, mid red,
mid red, and mid red.
Now remember, these are my classes.
So I could have some values here like 1,
3, 4, 9, and 10.
And somehow, my classifier has put the 10 into this bin,
4 and 9 in this bin,
and the 1 and 3 here.
But 4 is really close to 3,
and if this is value 4,
if I would have made this light red,
I'm going to start seeing them spatial cluster of light red on the map.
But if I don't take into account my spatial constraints,
I could wind up with some classifier that gives me this as my result.
Does that necessarily make sense?
Is that the message I'm trying to show on my map?
And so I can set up different optimizations that also look at neighboring values to
try to determine their role that they might play in the underlying classification.
And really, with chloropleth maps,
we also need to consider standardizing the statistics.
So, for example, in the alfalfa crops,
I showed the number of crops being harvested,
but I didn't really normalize.
I did by number of acres.
But the number of acres out West in a county like this,
we have way more acres available.
It's a much bigger county than say a tiny county in Rhode Island.
So should I have been dividing this by population or by area?
It depends on what I'm trying to show.
It depends on if I'm just trying to show magnitude,
if I'm trying to show a percentage so I can say what area of
the country has a higher percentage of alfalfa crop by area.
Here, I'm looking at the percent population residing in urbanized areas.
So if I only showed the population,
then regions like Chicago,
New York, and LA can become dominant.
But here, if I the percent population residing urbanized areas,
then we can start seeing other sorts of trends and regions on the map,
other critical areas where we've seen high population density in urban areas.
And we also need to consider what we call the modifiable areal unit problem which is
a source of statistical bias occurring when data is aggregated into districts.
So we need to be really careful about thinking about what we're showing on the map,
should we have any underlying denominator,
and what sort of message do you get to be trying to
show with our map regarding different datasets.
We also have the ecological fallacy to contend with,
where inferences about individuals are based solely upon
aggregate statistics collected from the group to which those individuals belong.
So the ecological fallacy is assuming that individual members of
a group have the average characteristics of the group at large.
So when we're showing things like average values,
we may be misrepresenting what's going on in the underlying data.
And group characteristics do not necessarily apply to individuals with that group.
Imagine that we have some data distribution like this and I've interviewed the group
and part of the group is very poor and part of the group is very
wealthy and I'm showing median household income.
Well, the median household income in that area would look like this.
It would be somewhere in the middle between those that we chose the mid average region.
This area might be very interesting because of this split in distributions.
And so the group characteristics do not necessarily apply to individuals.
So we have to be really careful in thinking about how we create these chloropleth maps.
What are our color scheme choices?
Do we want to show some difference from zero.
So in this example, we have our divergent color scheme.
Should it be sequential?
Should it be qualitative?
And what sort of classification do we want to use?
Here, we talked about equal interval, quantile, maximum breaks,
optimal breaks that we can even think of
other classification methods we've learned from data mining to apply to these as well.
And there are tons of different map classifications and work on
which classifications are most easy for users to perceive.
And a lot of times, we're able to use some of
the simpler classifications like equal interval if we have
some underlying normal data distribution
that we may be able to get by transforming the data.
So hopefully,from this module,
you're able to understand how we can apply some of the techniques we've been
learning to create our own map classifications,
which allow us to create our chloropleth map,
set up the different color bins,
and create our resultant visualization.