0:00

[MUSIC]

Now many of you have perhaps come across the phrase of big data.

Now while this is not a course in big data per se, well what we're going to consider

now really reflects some of the main challenges of the big data era.

So big data, perhaps in simplistic terms,

is a huge volume of data, hence that word big.

And our goal is to try and simplify, to make sense of it, and

draw insights from it.

So one of the key challenges of big data is what we call data reduction or

data summarization.

Reducing the complexity of our data set, or

we might say reducing the dimensionality of our data set, such that it's easier for

us to digest and draw some insights from it.

Now we'll just consider some fairly small data sets for this purpose.

But nonetheless, it underlies the main goal of trying to reduce the complexity

and see the big picture of our data set.

Now data reduction, we could really consider of doing it in one of two ways.

One is visually, which we'll consider here.

Namely data visualization, where we can produce simple plots

of particular variables and hopefully draw some initial insights from those.

And also we would like to come up with some simple descriptive or

summary statistics whereby we reduce a very complex data set numerically.

So reduce a large number of values into just perhaps one or two figures.

So data visualization, remember we mentioned in the previous session about

the different types of levels of measurement.

We had our categorical and our measurable variables.

And we said at the time that we have statistical techniques to apply to

different types of variables.

Well, similarly with data visualization or data presentation if you prefer,

depending on the types of variables you have, they'll be different

types of graphical display which are appropriate in each situation.

So, for example, if we had some nominal data, for example,

if we considered the list of nations or took a sample of countries in the world.

So different countries, the name is simply a label,

an identifying label, and hence represents a nominal variable.

So let's imagine we had a sample of about 150 or so countries.

And then we may wish to recognize which continent each country belongs to.

So how might we wish to reduce maybe 150 countries

in a more easier to digest format?

Well perhaps a simple tabular form might be appropriate.

We could have a table with the frequency counts representing how frequently

nations in different continents appeared in our sample data set.

But if, let's say, we had 20 appearing as a frequency count.

Is 20 a big value or a small value?

Well it's going to depend on what sort of a context we're dealing with.

20 out of 22, say, is perhaps a large value.

20 out of 1,000, a much smaller value.

So rather than consider simple frequency counts, we may wish to transform

these into percentages which would reflect the relative frequency

with which let's say the different continents occurred in our data set.

Now, perhaps a challenge for you is to go to any sort of mainstream media and

look for examples of tables in various news articles, and

see what kind of variables are being summarized within those tables.

3:29

But sometimes we, as human beings, can relate more to pretty pictures and

diagrams than potentially just lots of figures appearing in tabular form.

So if, for example, we had a categorical variable,

such as the continent to which a particular country belongs,

then an appropriate graphical display might be a simple bar chart.

Now, of course, in practice,

we would use computer software to generate diagrams for us.

We'd never really draw these things manually with pen and pencil and paper.

But do be conscious that when computer software produces a diagram,

take great care that the diagram which is generated is not actually

distorting what is trying to be communicated to the wider audience.

So take great care about any axes which are used and

any scales used on those axes.

And make sure that your diagram is not distorting what you're trying to convey.

Equally, if you're looking at a diagram produced by someone else,

particularly by perhaps a politician or an advertizing company,

they may be deliberately trying to distort what the data genuinely is showing

to try get across their particular perspective and point of view.

Now perhaps a very common type of diagram used when

analyzing a single measurable variable might be that of a histogram.

So, for example, we might consider the GDP per capita

across a set of countries within a sample data set.

And a histogram very clearly brings to life the data, and

you get a very strong sense of the distribution of GDP per capita

across the various countries in our data set.

Now that word distribution, that perhaps rings a bell with some work we did in week

two of the course when we constructed some simple probability distributions.

For example, for the score for a fair die, say.

Or perhaps the chances of getting heads and tails when tossing a fair coin.

But, of course, those probability distributions were derived theoretically.

For example, that fair die, hence the six equally likely outcomes.

And we attached a probability of one over six to each of those.

In contrast here, we're looking at the distribution not of a sort of

theoretical probability distribution, but rather a sample distribution.

Ie the distribution of the variable being considered within our sample data set.

So looking at this histogram of GDP per capita,

you can see that it varies a great deal across the countries in our data set.

With the vast majority of countries being quite poor on a GDP per capita basis,

with one or two countries performing very well on the GDP per capita basis.

Of course, we need to look at these things more numerically, and

just because a country has a high GDP per capita

does not necessarily mean everyone within that country is wealthy.

Per capita is simply an example of an average which we're going to be looking at

in the next section whereby we're looking at the total GDP in a country and

imagine, hypothetically, it was split equally among the population.

Just as in countries with very low levels of GDP per capita, of course,

perhaps those at the very elite of society if there was a lot of corruption, say.

Those at the top are perhaps doing very well in life,

with the vast majority of the population struggling a great deal.

So GDP per capital perhaps one fairly simplistic metric

to gain some sense of how wealthy a country is.

But, of course, is not revealing the entire picture.

But nonetheless, it's achieving a simple goal of data reduction.

In a simple histogram we can convey quite a lot of information about how wealth is

spread in different countries around the world.

7:22

But so far, things like the bar chart, the histogram is for

displaying a single variable each time.

But what if we wanted to go a bit beyond this and

start to consider perhaps the relationships between two variables?

Well remember, the builder or decorator with the tool kit, different tools for

different jobs.

Similarly, we have different types of graphical display

depending on the types of variables as well as the numbers of variables present.

So continuing with countries for a moment, let's consider GDP per capita.

So some, perhaps an imperfect metric, but

a metric nonetheless of how wealthy countries are.

And perhaps it's also considered a corruption.

Specifically here the control of corruption.

So in this case, if we consider a control of corruption variable,

low scores would indicate that there's not much control of corruption and hence,

there's a large amount of corruption in a particular country.

So control of corruption, GDP per capita, so

these would be two measurable variables.

So if we wanted to display these,

then a scatter diagram would be the appropriate form to use.

So if we consider plotting on the x axis, so the horizontal axis is control

of corruption variable and on the y axis, the GDP per capita variable.

One can instantly see whether or

not there seems to be a relationship between those two variables.

Now this scatter plot tends to demonstrate a positive relationship,

not perfect, but nor is it a perfect world.

And it's very rare in the social sciences, in particular,

to come across perfect relationships.

But there does seem to be a tendency that countries which tend to control corruption

more, and hence have lower levels of corruption,

tend to enjoy higher levels of GDP per capita.

Now, we could get into some semantics about whether or

not this is a linear or non-linear relationship.

But to keep the argument fairly simplistic, let's assume that we could

approximately fit a line through those points, albeit not a perfect fit.

9:32

So this would indicate that there is some degree of correlation

between these two variables.

Countries with low levels of corruption, ie, where corruption is very well

controlled, tend to enjoy high levels of GDP per capita.

But, of course, correlation does not necessarily imply causality.

Of course, many politicians and

advertisers would perhaps be wise to remember that fact.

Correlation does not equal causality.

Indeed, particularly with the social sciences,

it can be a great challenge to actually try and infer causality.

Let's consider corruption versus GDP per capita.

You could give some sort of more qualitative arguments for

why one would influence the other.

Perhaps if a country suffers from high levels of corruption,

maybe that is stifling the ability of the economy to grow.

And hence the economy has a low level of GDP and

hence a low level of GDP per capita.

So that argument would suggest that the level of corruption is determining

the level of economic prosperity.

Of course, you could argue it's the other way round.

If an economy is performing very well and everyone is very wealthy,

then there is no need to have corrupt officials, say.

So a correlation can be easily perhaps identified from a scatter plot,

but trying to infer causality is a much more challenging proposition.

But nonetheless, a nice extension from displaying a single variable,

perhaps through a bar chart if it's categorical or

a histogram if it's measurable, to an example where

we can show two measurable variables through a scatter plot.

So again, perhaps another takeaway challenge for you.

Have a look through the mainstream media and try and

find some articles which are showing various scatter diagrams.

And think about what sorts of relationships between the variables is

trying to be conveyed in the report itself.

[MUSIC]