0:00

[MUSIC]

Â So let us proceed to the first of the three major approaches within this cost.

Â The first being factorizing data.

Â Let me first play a couple of important quotes that will make the point that I'm

Â going to be trying to make in the next few sides.

Â One, this is a quotation by Mike Jensen in 2001 paper.

Â It is logically impossible to maximize in more than one dimension unless those

Â dimensions actually collapse into one monotonic transformations.

Â You tell a manager to maximize current profits, market share,

Â future growth in profits, and all of that together.

Â Basically the manager will have no way to make a reasoned decision.

Â What they going to do is use heuristics,

Â use thumb rules basically shoot in the dark.

Â So result will be confusion that handicaps the firm, and it's competition for

Â survival.

Â Here's another quotation along similar lines.

Â Firms are swamped with measures, it is commonplace for

Â firms to have 50 to 60 top level measures and managers are supposed to measure and

Â keep track of all them and optimized all of them.

Â Many firms have struggled and unsuccessfully to drive measures of

Â shareholder value because of this huge confusion that we see around us.

Â And this is coming from Fred Meyer 2002 paper.

Â Which brings me then to why of factorizing data?.

Â So based on the quotations that we just saw,

Â it becomes clear that at least in some instances.

Â Businesses have way more data than they probably need, way more variables,

Â metrics, and measures than is optimal.

Â If there were some way right, take a look at this.

Â 50 digital metrics, 50 digital metrics, and so on.

Â If there was some way whereby which I could reduce the size,

Â or the dimensionality of these metrics, of this data without degrading

Â the information contained such a method would be very valuable indeed.

Â Now if there exists a coherent group of variables in one of the things in which

Â this comes through.

Â If there is a coherent group of variables that are interrelated to one another.

Â Then ideally, I'd like to identify them and extract their information

Â content project them onto a third single variable that would help a lot.

Â So let me in some sense show you how this might go.

Â The single variable would ideally represent and

Â replace the entire variable group.

Â And that as where we will be entering the factorization of data.

Â The how of factorizing data, at an intuitive level and

Â I'm not going to go algorithmic here.

Â Intuitively, what do we mean by factorizing data, right?

Â The word is a factor.

Â Factor is quite literally the inverse of multiple.

Â Now here's an example, okay?

Â Using primes and composite numbers.

Â 3 times 4 is 12, right?

Â So 12 is a multiple of 3, and 3 is a factor of 12.

Â So basically one and the other, they just converses.

Â Now this is just numbers.

Â So let's actually take them to system of numbers, let's get them to matrices.

Â Recall matrix multiplication, say I have two matrices, A and B.

Â A has the dimension i by j, right?

Â So let's say, 10 rows and 5 columns, 10 by 5.

Â And B has the dimensions j by k.

Â So let's say, 5 rows and 20 columns.

Â What happens when I multiply the two?

Â What I basically get is the 5 dropping off, and I would get 10 by 20.

Â So AB would be i times k.

Â Conversely, this product AB can be factorized.

Â It can be split into two multiples, into to matrices, A and B.

Â Where am I going with this?

Â Why am I talking about it?

Â Because we are very interested in factorizing matrices.

Â Because data sets are ultimately matrices.

Â Here is an example.

Â Let me show you the psychographic profile of an MBA class at ISB.

Â Now there is this entire theory of personality,

Â that talk about what I'm going to talk about next.

Â There are these five factors that summarize the personality model.

Â So you've openness, conscientiousness, you've extraversion,

Â you've neurosis to some extent.

Â So you have these five factors.

Â I would like you in some sense to stop and actually look at the notes given

Â that displays these five factors a little more detailed.

Â How do I factorize on MBA Cohort's personality?

Â Basically that was the web survey that they had taken, and

Â you can see the kind of questions they were asked.

Â Each of those questions in some sense relate to some of those factors.

Â 208 respondents were there who took this survey.

Â There were 45 questions and also my data matrix size is 200 by 45.

Â How many factors underlie this 45 variables?

Â Well it's big 5, so there are 5.

Â I'd like you to go to reading one, and

Â look at the details of what we are about to do next.

Â Okay, so what we actually have is a 208 by 45 matrix, which is our input matrix,

Â and the number of factors we already know from theory is 5.

Â So what I'm going to do is factorize the 208 by 45

Â matrix into a 208 by 5 matrix called factor scores.

Â Basically, and 5 by 45 matrix which are called factor loading.

Â So what I'm actually doing is taking this large 208 by 5, 45 object and

Â projecting it into two different subspaces.

Â All right, and the third one, and that will come out as in some sense.

Â A biproduct of what we are doing, are uniqueness and commonality scores, and so

Â they will come out.

Â Now what I'm going to ask you to do, and

Â we will come to how to actually run this procedure in a short while, so hang on.

Â Visualizing your survey data, so basically

Â when the survey data where in some sense factorized, what did they look like?

Â Well this is what they looked like.

Â Each of those nodes, it's one of those variables.

Â And you can clearly see coherent groups of variables emerge.

Â You can basically see the connections between the nodes, the edges.

Â The tightness of the edges the higher the correlation.

Â And the green edges are positive correlation,

Â the red edges are negative correlation, but the correlation is strong.

Â The absence of an edge would be a zero correlation, so

Â you can clearly see five distinct groups of variables emerging.

Â That is exactly what we get from factorization of data.

Â Working with data factorization, at this point, I'm going to stop.

Â And basically ask you to try to run factor analysis on Rstudio,

Â on your local machines directly by using the code I'm giving.

Â Please follow the instructions for how to do this particular running.

Â Go to handout reading one and the reading will basically display the results

Â that will emerge from the analysis.

Â So basically what I'm going to ask you to do is copy and

Â paste these three, four lines that I give you.

Â It might take some time, if you don't already have the libraries installed.

Â The system will automatically download and install those libraries, so

Â you should have a good net connection.

Â If you are watching Coursera, my guess you already have one, right?

Â So once it is installed them, and you will need to do those in the desktop,

Â app doesn't work on mobiles.

Â So once that is done it will open a desktop app for

Â you, into which you read the data and basically these results will show.

Â So we're talking about factor loadings, factor scores,

Â uniqueness scores and so on.

Â Some interpretation if these results is also given on the on the reading, and

Â I would request you to go through that carefully.

Â Implications that flow from the interpretation then would become obvious,

Â in some sense.

Â Okay, which takes me on to summarizing factorization.

Â So think back about what we just did, right?

Â What does factorization need?

Â What it actually needs is a data matrix, basically all it needs.

Â Rows as units and columns as variables.

Â And the second input that it need it's the optimal number of factors

Â to the solution that has to be provided before hand.

Â You can ask the system to search for

Â the optimal number of factors that it will do, okay?

Â But for now let's take this year things.

Â Suppose you all ready know what the optimal number of factors is.

Â What will also give you is a visualization of the factor solution space,

Â the correlational structure of the variables.

Â There are variables one, five, and ten are correlated part of the same group.

Â I will know that related structure of the data is emerging, basically.

Â So that's what we want to see.

Â Arrangement of the units and factor space.

Â What does it mean?

Â It basically means that any wide data set where you have a large number of

Â variables can be factorized into key variable groups.

Â It also means that any unit of analysis can be positioned and

Â grouped in factor space.

Â It applies to most kinds of datasets and units of analysis.

Â Remember, this is correlational only, and we cannot make causal reference on this.

Â However, you are now the correlational level.

Â A lot of insight, a lot structure emerges.

Â Now very powerful in it's application.

Â So let me quickly recap factorization, and so what it is it?

Â It is a general procedure to efficiently reduce data dimensionality.

Â It does other things also, and

Â we are looking at this particular angle of what it does, and how does it do this?

Â It projects it onto this large dimensional variable space

Â onto a small dimensional factor space.

Â That's basically what it's doing, and then what does it do?

Â It yields position coordinates, the factor loading

Â the position coordinates of every respondent in factor space.

Â The factors codes would be that, and the composition of each factor.

Â Requires the factors be interpreted and this can get tricky.

Â What does this factor mean?

Â I have variables, one, five, and ten loading onto factor one.

Â What does factor one mean?

Â You have to come out with that interpretation based on what those

Â variables man.

Â What is that latent driver that integrates or unites these three variables.

Â Okay, it works very well with structured metric data and we saw that.

Â Will it work with unstructured non-metric data as well?

Â It turns it does, and we will see this later today.

Â Now where might this application be applied or hold for business in general?

Â A lot of places basically, a lot of places.

Â Applying factorization to your context, how many disparate variables are there

Â on a typical ERP system, in a supply chain management system, in a CRM system?

Â Just so many of them, imagine the number or variables out there.

Â Would it not be great to know if there is a current structure that

Â underlies this disparate set of metrics and measures?

Â Think of a classic Compustat or a Prowess, or a Capital IQ database, right?

Â Think about customer profiles and touch points, and purchase histories.

Â Just imagine the number of variables involved.

Â If I could dimension reduce them while retaining most of the information,

Â just how valuable is that?

Â In press and social media mentions,

Â I mean there's just a bunch of things that are possible.

Â And all of these are disparate variables, if there is some structure

Â underlying them, the metal will find it, that's basically what's going on.

Â How about a group exercise to find out?

Â I'm going to give out some questions and you are free to meet up in the forum and

Â discuss with other folks, and come out with some solutions.

Â [MUSIC]

Â