So, we get color, we get shape, we get position,

we can even think about adding a texture to these circles if we wanted.

They gets sort of busy.

Notice we have different sizes.

So, we have size in the circle,

and here size corresponds to something like population.

Color corresponded to location in the world.

So, we're combining these different visual variables together to

move past just this 2D representation,

where we had variable X versus variable Y,

it's now multiple variables.

Then he even adds in animation.

So we can move our data over time and so animation

provides us with yet another variable to see trends and patterns and changes over time.

There's nice work on looking at animated scatterplots by John Stascho.

So, you can take a look, I encourage you to watch Hans Rosling's Gapminder video,

just to see how his nice presentation goes for these multivariate cases.

Now the real question though is,

if I have all of these datasets which scatterplots should I draw off?

If I have all of these variables like income, life expectancy, population,

I can create a ton of different scatterplots for every two variables,

I can make a scatterplot for GDP vs the percentage of trade,

I can make a scatterplot for GDP versus life expectancy,

I can make a scatterplot for GDP versus population.

So, the more variables I have,

the more scatterplots that I can make.

So, I want to think about how do I

help people again detect the expected, discover the unexpected?

How do I identify anomalies in these scatterplots?

How did you identify interesting trends?

One way is through what was coined as Scagnostics.

So, this was, Tukey coined this back in the early 80's,

talking about graph theoretic measures for

detecting structural anomalies in scatterplots.

So, if I draw a scatterplot,

what are the different things that I'm seeing?

So, for example, remember when I drew a scatterplot that had properties like this,

notice this has a clumpiness property.

Is there a way where I can theoretically have some mathematical computation of that?

This may be a really interesting view to say, "Hey,

these particular elements are highly related in these two variables."

So, we can use these graph theoretic measures to help

users pick views to show particular structures of interest.

This was coined by Tukey to help us determine

which relationships between variables should we pick?

If I don't have time or capacities to show every possible Pairwise combination.

Which is the best Pairwise combination to show?

Which is the second best?

And so forth. So, Scagnostics gives us

a bunch of different equations that we can start calculating.

So, for example, we can figure out

the minimum convex hole that will enclose all of our points.

So, for example, the minimum convex hole if I draw some points on the screen,

the minimum convex hole is what's

the smallest polygon that's going to connect all of these together.

We can measure things like area of this polygon to try to do some measure,

we can have some sort of correlation measures to talk about how correlated the data is,

and other elements like that.

There's a whole list of Scagnostics and people have been

working on those sorts of measures for a very long time.

Wilkinson proposed nine Scagnostic measures to characterize scatterplots.

So, outlying, sparse, striated,

skinny, monotonic, skewed, clumpy, convex, and stringy.

All of these have a different equation associated with them but what's nice is

wilkinson developed the library where we can calculate all of these automatically,

and start using these to try and rank scatterplots.

Again, trying to think about importance in

showing people what's important and interested in their data.

We've talked about Shneiderman's information mantra

where it's overview first, zoom and filter,

details on demand, and Daniel Keim talked

about visual analytics mantra for analyze first as opposed to overview first.

So, if you have a large dataset you put it into some sort

of analytical framework whether it's going to be deep learning,

whether it's going to be supervised learning through clustering,

unsupervised learning through clustering,

whether it's going to be supervised learning through decision trees, things like that.

Whether it's going to be creating a bunch of scatterplots and measuring

how outlying the particles are or how skewed the particles are?

We can use these measures to then characterize different scatterplots.

What's nice about scatterplots is we

can also create what's called the scatterplot matrix.

So, even if I have a whole lot of variables for a scatterplot,

I can actually go ahead and organize these into a matrix where I can

do each variable versus each other variable.

So for example, I have At Bats,

every Y-axis, I'm sorry every X-axis in this direction is the same.

So, we see we have our At Bats here.

So, every X axis is At Bats across our row.

Across our columns, we get changes in variables.

So, here we get At Bats versus At Bats,

we get Runs versus Runs,

we get Batting Average versus Batting Average.

In this example here,

we've got Batting Average and Runs,

we've got At Bat and Runs.

So, we can start looking and see if there's any interesting trends.

Now the diagonal is always going to be these straight lines,

this is due to the fact that we're plotting the same variable against it self,

so it should be highly correlated.

So again, each dot is a baseball player and we can start looking for trends and patterns.

Here we might say well,

there's one outlier in this plot

otherwise it looks like it might have some sort of trend here.

Here we may not see much relationship here,

but scatterplots let us get a quick overview and the problem is

I could have rearranged any row and any column in any way I want.

So, I could have swapped these two columns or

these two rows and I would get a different order,

a different layout and how I'm going to go through

these orders and layouts is really important and can take time.

That's where we might want to use these Scagnostic measures to think about

how we can order what we call our scatterplot matrix.

We can even think about adding an interaction,

and adding interaction allows viewers to visualize other combinations of variables.

So, if I have a scatter plot in two dimensions,

I can always extrude a third dimension

and rotate my points and show what this looks like.

Nicholas Elmphast has a nice paper called rolling the dice,

and you can take a look at some of the nice interactions he added

in with these scatterplot matrices and

allowing extensions and things to let people visualize this in 3-dimensional space.

Again, we can add color to the points,

we can add some shape or sizes,

all sorts of things to add more information into these variables.