The critical feature of this historical data is

that classification of the observations is known and

it is used to learn how to classify future observations.

Because this piece of information is available,

the process is known as supervised learning.

For instance, this table shows ten answers to the job satisfaction survey and

it also indicates whether or not the employee quit.

The two employees that quit had low ratings for the salary questions five,

six and seven and some mixed ratings for the supervisor questions one through four.

A prediction model built on this data will fall in the category of

supervised learning, because the outcome that

the model is trying to predict is known in historical data.

In unsupervised learning, the observations in the historical data are not labeled.

That is, we don't know if an observation belongs to one group or another.

This means that we don't know how many different groups there are in

a population from which the dataset originated.

Discovering the number of groups is therefore,

one of the main outcomes of the analysis.

For example, in a previous video,

we described how the market intelligence firm, Information Resources Incorporated,

conducted a cluster analysis of survey data to establish that the market of

natural and organic products consisted of seven distinct segments,

a number that was not known prior to the completion of the analysis.

Cluster analysis can also be applied to historical data that is labeled

with the purpose of finding new labels.

For example, in one study,

cluster analysis was used to categorize mutual funds based on

their financial characteristics instead of their investment objectives.

The historical data for

the study consisted of 904 different funds that fund managers had

classified into seven categories according to the investment objectives.

That is the fund managers assigned a label to each fund and

decided there were seven possible labels.

However, a cluster analysis on financial variables related to the funds

concluded that there were only three distinct fund categories.

The reduction in the number of categories has significant benefits to

investors seeking to diversify their portfolios.

The study determined that the consolidated categories

were more informative about performance and

risk than the original seven categories created by the fund managers.

In terms of data to use, the analyst initially considered 28 financial

variables that were related to risk and return.

However, after applying principal component analysis, they found that 16 out

of the 28 variables were able to explain 98% of the variation in the dataset.

Therefore, they only use 16 variables per cluster which as we all ready mentioned,

resulted in three fund categories.

This example shows that dimensionality reduction and

data reduction compliment each other.

As a matter fact, it is a common practice to apply dimensionality

reduction techniques such as PCA before clustering.