Let's start with the objectives and settings of unsupervised learning.

We said in week one that in unsupervised learning,

there is no teacher.

The objective is rather to make some sense out of data X that is given to you.

But what exactly does it mean to make sense out of data?

For example, if you just store the data,

it doesn't mean that you made sense out of it.

The answer is no, because simply storing your data does not give you either insights,

or actionable signals about your data.

Instead, your data should be compressed to form some sort of a meaningful representation.

This means in particular that,

if you think in terms of data storage of your data,

your loan representation should be more

compact in terms of number of bytes that you would use to store this data.

Imagine for example that you want to store

the historical data on 50 years of daily rituals,

or 5,000 US stocks,

it's quite a bit of data to store.

But now assume that all these data is

actually white noise with some mean and some covariance metrics.

In this case, there is no point to store any data,

it all be all stored in the mean and covariance.

You can always replicate the rest in one line of Python code.

On the other hand, assume that the mean of data changes every day,

so that it now carries some information about the changing state of the market,

while the covariance stays constant in time.

In this case, we would want to store the historical records for these changing mean,

and market mode if you wish,

alongside with storing fixed covariance metrics,

and discard the rest of the data as completely redundant, and so on.

As you see in each of these scenarios,

we compress the original data into something more compact and more meaningful.

Now, let's see a bit more specifically how this can be done in unsupervised learning.

First, let's start with a finance of

the goal of learning that will be more suitable in the case of unsupervised learning.

We said that the general goal of learning is to generalize from data.

In the specific case of unsupervised learning,

this means that we want to find useful representations of data.

Now, there are multiple ways to launch

such data representations depending on the type of data and the task.

We can distinguish between four different ways

to build such representation in unsupervised learning.

The first one is called dimension reduction.

The general objective here is to reduce,

and then dimensional data vector to a vector in the space of lower dimension K,

while keeping most information in the data.

This is done either for data visualization,

or as an intermediate pre-processing step in order to

select this small set of features that are important for supervised learning.

The second class of unsupervised learning is clustering.

In this case, the objective is to bucket all data into groups,

or clusters such that,

all points in one cluster are more similar

to each other than points that belong in different clusters.

As we will discuss in more details later,

regime change in detection,

a very important topic in finance can also be viewed as

a special case of clustering when you have only one cluster.

The next type of unsupervised learning algorithm is use with density estimation.

These are probabilistic methods that aim at estimation

of the probability density distribution corresponding to observed data.

For these methods, kernel density estimations can be

mentioned among the most popular approaches.

And finally, the last type of unsupervised learning we'll be dealing with,

is prediction of sequences.

In this setting, you have data in the form of sequence,

or a time series,

and you want to predict the future values for this sequence.

Now, this sounds very similar to time series predictions,

so why do we put this topic on the unsupervised learning algorithms?

Well, this is because we can't know the future,

so there cannot be no teacher to provide training pairs.

It's only if we make additional assumption that the future will be similar to the past,

then we can rely on the past data in order to make predictions about the future.

But still conceptually, it's an unsupervised task.

In this first introductory course,

we will be mostly talking about the first two unsupervised learning tasks namely,

dimension reduction and clustering.

The other two topics will be considered in details in follow up courses.

For your preview, I listed here some approaches that we will be discussing.

For dimension reduction tasks,

we will cover such topics as

the principal component analysis and its non-linear extensions,

latent variable models, and various versions of autoencoders,

which are a special type of neural networks.

For clustering, we will be talking about K-means clustering,

probabilistic clustering, hierarchical clustering.

For density estimation, we will cover such topics

as kernel density estimation and Gaussian mixtures.

And finally for sequence modeling,

we will be talking about Hidden Markov Models or HMM,

Linear Dynamic Systems, or LDS,

as well as about neural network models such as Recurrent Neural Networks.

Now, let's talk about general principles of unsupervised learning methods.

To this end, let's first start with supervised learning.

The process of training a supervised learning algorithm is shown in this diagram.

The training data is made of raw features X and labels Y.

Raw features X are first transformed into

features F of X that actually going to the algorithm.

This is called the feature extraction step.

Then, features F of X are passed to

a supervised learning algorithm that has adjustable parameters theta.

The training involves comparing the model outputs Y-hat with the two labels Y,

and model parameters are tuned by an optimization all the

way to minimize the discrepancy between Y-hat and Y.

Now, let's see how this diagram changes in the setting of unsupervised learning.

The main idea here is that when we do not have Ys,

we use Xes instead.

Here you see a diagram of training for

a particular unsupervised learning algorithm called the autoencoder.

What it does it throws,

builds a data representation in a future extraction model.

In the case of an autoencoder,

we will call this step an encoder.

If the dimensionality of the encoder to

representation is smaller than the dimensionality of the original data,

this step produces dimension reduction.

The second block, which was

our machine learning algorithm for supervised learning now becomes decoder.

The decoder takes the encoded representation,

and tries to reproduce the original data from this presentation.

Let's call this model-based representation of the original data X-hat.

Now, the loss metric for this case would be obtained by

comparison of our original Xes with values of X-hat.

So, the more similar X-hat to X,

the better dimension reduction it produces.

You could wonder what can prevent an algorithm from simply memorizing all data,

so that the loss metric would be zero in this case.

And the answer to this question is that,

it's the model architecture that prevents it.

If the dimension of

the encoded representation is less than the dimension of the original data,

simply memorizing the data would not be possible.

Other unsupervised learning algorithms can also be

conceptualized along the same kind of diagram.

For example, a similar diagram can be drawn for clustering as shown here.

The outputs of clustering are cluster labels,

showing clusters to which each point in the data cell belongs.

These outputs are compared with the data.

So, as you can see the output of any unsupervised learning algorithm is always

some sort of a transformation of an

initial typically high dimensional data by some function.

This function is fitted to match some characteristics of the data,

while filtering out some other irrelevant characteristics.

In the next video,

we will start our discussion of

dimension reduction methods with the most popular method of this sort namely,

the principle component analysis or PCA.