[MUSIC]

So let us proceed to the first of the three major approaches within this cost.

The first being factorizing data.

Let me first play a couple of important quotes that will make the point that I'm

going to be trying to make in the next few sides.

One, this is a quotation by Mike Jensen in 2001 paper.

It is logically impossible to maximize in more than one dimension unless those

dimensions actually collapse into one monotonic transformations.

You tell a manager to maximize current profits, market share,

future growth in profits, and all of that together.

Basically the manager will have no way to make a reasoned decision.

What they going to do is use heuristics,

use thumb rules basically shoot in the dark.

So result will be confusion that handicaps the firm, and it's competition for

survival.

Here's another quotation along similar lines.

Firms are swamped with measures, it is commonplace for

firms to have 50 to 60 top level measures and managers are supposed to measure and

keep track of all them and optimized all of them.

Many firms have struggled and unsuccessfully to drive measures of

shareholder value because of this huge confusion that we see around us.

And this is coming from Fred Meyer 2002 paper.

Which brings me then to why of factorizing data?.

So based on the quotations that we just saw,

it becomes clear that at least in some instances.

Businesses have way more data than they probably need, way more variables,

metrics, and measures than is optimal.

If there were some way right, take a look at this.

50 digital metrics, 50 digital metrics, and so on.

If there was some way whereby which I could reduce the size,

or the dimensionality of these metrics, of this data without degrading

the information contained such a method would be very valuable indeed.

Now if there exists a coherent group of variables in one of the things in which

this comes through.

If there is a coherent group of variables that are interrelated to one another.

Then ideally, I'd like to identify them and extract their information

content project them onto a third single variable that would help a lot.

So let me in some sense show you how this might go.

The single variable would ideally represent and

replace the entire variable group.

And that as where we will be entering the factorization of data.

The how of factorizing data, at an intuitive level and

I'm not going to go algorithmic here.

Intuitively, what do we mean by factorizing data, right?

The word is a factor.

Factor is quite literally the inverse of multiple.

Now here's an example, okay?

Using primes and composite numbers.

3 times 4 is 12, right?

So 12 is a multiple of 3, and 3 is a factor of 12.

So basically one and the other, they just converses.

Now this is just numbers.

So let's actually take them to system of numbers, let's get them to matrices.

Recall matrix multiplication, say I have two matrices, A and B.

A has the dimension i by j, right?

So let's say, 10 rows and 5 columns, 10 by 5.

And B has the dimensions j by k.

So let's say, 5 rows and 20 columns.

What happens when I multiply the two?

What I basically get is the 5 dropping off, and I would get 10 by 20.

So AB would be i times k.

Conversely, this product AB can be factorized.

It can be split into two multiples, into to matrices, A and B.

Where am I going with this?

Why am I talking about it?

Because we are very interested in factorizing matrices.

Because data sets are ultimately matrices.

Here is an example.

Let me show you the psychographic profile of an MBA class at ISB.

Now there is this entire theory of personality,

that talk about what I'm going to talk about next.

There are these five factors that summarize the personality model.

So you've openness, conscientiousness, you've extraversion,

you've neurosis to some extent.

So you have these five factors.

I would like you in some sense to stop and actually look at the notes given

that displays these five factors a little more detailed.

How do I factorize on MBA Cohort's personality?

Basically that was the web survey that they had taken, and

you can see the kind of questions they were asked.

Each of those questions in some sense relate to some of those factors.

208 respondents were there who took this survey.

There were 45 questions and also my data matrix size is 200 by 45.

How many factors underlie this 45 variables?

Well it's big 5, so there are 5.

I'd like you to go to reading one, and

look at the details of what we are about to do next.

Okay, so what we actually have is a 208 by 45 matrix, which is our input matrix,

and the number of factors we already know from theory is 5.

So what I'm going to do is factorize the 208 by 45

matrix into a 208 by 5 matrix called factor scores.

Basically, and 5 by 45 matrix which are called factor loading.

So what I'm actually doing is taking this large 208 by 5, 45 object and

projecting it into two different subspaces.

All right, and the third one, and that will come out as in some sense.

A biproduct of what we are doing, are uniqueness and commonality scores, and so

they will come out.

Now what I'm going to ask you to do, and

we will come to how to actually run this procedure in a short while, so hang on.

Visualizing your survey data, so basically

when the survey data where in some sense factorized, what did they look like?

Well this is what they looked like.

Each of those nodes, it's one of those variables.

And you can clearly see coherent groups of variables emerge.

You can basically see the connections between the nodes, the edges.

The tightness of the edges the higher the correlation.

And the green edges are positive correlation,

the red edges are negative correlation, but the correlation is strong.

The absence of an edge would be a zero correlation, so

you can clearly see five distinct groups of variables emerging.

That is exactly what we get from factorization of data.

Working with data factorization, at this point, I'm going to stop.

And basically ask you to try to run factor analysis on Rstudio,

on your local machines directly by using the code I'm giving.

Please follow the instructions for how to do this particular running.

Go to handout reading one and the reading will basically display the results

that will emerge from the analysis.

So basically what I'm going to ask you to do is copy and

paste these three, four lines that I give you.

It might take some time, if you don't already have the libraries installed.

The system will automatically download and install those libraries, so

you should have a good net connection.

If you are watching Coursera, my guess you already have one, right?

So once it is installed them, and you will need to do those in the desktop,

app doesn't work on mobiles.

So once that is done it will open a desktop app for

you, into which you read the data and basically these results will show.

So we're talking about factor loadings, factor scores,

uniqueness scores and so on.

Some interpretation if these results is also given on the on the reading, and

I would request you to go through that carefully.

Implications that flow from the interpretation then would become obvious,

in some sense.

Okay, which takes me on to summarizing factorization.

So think back about what we just did, right?

What does factorization need?

What it actually needs is a data matrix, basically all it needs.

Rows as units and columns as variables.

And the second input that it need it's the optimal number of factors

to the solution that has to be provided before hand.

You can ask the system to search for

the optimal number of factors that it will do, okay?

But for now let's take this year things.

Suppose you all ready know what the optimal number of factors is.

What will also give you is a visualization of the factor solution space,

the correlational structure of the variables.

There are variables one, five, and ten are correlated part of the same group.

I will know that related structure of the data is emerging, basically.

So that's what we want to see.

Arrangement of the units and factor space.

What does it mean?

It basically means that any wide data set where you have a large number of

variables can be factorized into key variable groups.

It also means that any unit of analysis can be positioned and

grouped in factor space.

It applies to most kinds of datasets and units of analysis.

Remember, this is correlational only, and we cannot make causal reference on this.

However, you are now the correlational level.

A lot of insight, a lot structure emerges.

Now very powerful in it's application.

So let me quickly recap factorization, and so what it is it?

It is a general procedure to efficiently reduce data dimensionality.

It does other things also, and

we are looking at this particular angle of what it does, and how does it do this?

It projects it onto this large dimensional variable space

onto a small dimensional factor space.

That's basically what it's doing, and then what does it do?

It yields position coordinates, the factor loading

the position coordinates of every respondent in factor space.

The factors codes would be that, and the composition of each factor.

Requires the factors be interpreted and this can get tricky.

What does this factor mean?

I have variables, one, five, and ten loading onto factor one.

What does factor one mean?

You have to come out with that interpretation based on what those

variables man.

What is that latent driver that integrates or unites these three variables.

Okay, it works very well with structured metric data and we saw that.

Will it work with unstructured non-metric data as well?

It turns it does, and we will see this later today.

Now where might this application be applied or hold for business in general?

A lot of places basically, a lot of places.

Applying factorization to your context, how many disparate variables are there

on a typical ERP system, in a supply chain management system, in a CRM system?

Just so many of them, imagine the number or variables out there.

Would it not be great to know if there is a current structure that

underlies this disparate set of metrics and measures?

Think of a classic Compustat or a Prowess, or a Capital IQ database, right?

Think about customer profiles and touch points, and purchase histories.

Just imagine the number of variables involved.

If I could dimension reduce them while retaining most of the information,

just how valuable is that?

In press and social media mentions,

I mean there's just a bunch of things that are possible.

And all of these are disparate variables, if there is some structure

underlying them, the metal will find it, that's basically what's going on.

How about a group exercise to find out?

I'm going to give out some questions and you are free to meet up in the forum and

discuss with other folks, and come out with some solutions.

[MUSIC]