So let's start with something different now. We've so far only looked at single columns or at vectors respectively. Now we want to learn something about the interaction of individual columns in our data. This is where covariance matrices and correlation are kicking in. Covariance and correlation tell us how two columns are interacting with each other. So for example, consider a table containing information about different buckets of water and two columns called volume and weight. Most likely if the volume increases, also the weight increases. Therefore, the information content of one column is decreasing once you know the other one. Since you have more than two columns in your data set and you want to get an overview of all correlations between all of them, the concept of covariance matrices will help you to understand how to grasp all information in a single table. Let's consider a single column of our data set and you want to calculate the measure how dependent this column is to a second one. Let column 1 contain a list of integers, as well as column 2. Note that column 2 just contains the double of the corresponding value of column 1. How can we find the single value telling us how similar these lists are? We need to find the measure of dependency between those two columns. This measure is called covariance. If both columns totally correlate, the measure is +1. In case the two columns are showing inverse dependency, the covariance is -1. A practical example of such a behavior is the mileage of a car per gallon of fuel and the weight. So the bigger the weight of the car, the less the mileage is. In case the two columns show no interaction at all the covariance is 0. Covariance can be calculated very easily. Note that the X and Y are vectors, so basically resembling the contents of a column in the table. As usual we have some normalization term dependent on the size of the table of vectors respectively. One term subtracts the mean of vector x from each value of the vector x. Whereas the other one does the same for the vector y. Finally, both values get multiplied and the new list containing this product is summed up and normalized. So again use Apache Spark and the data science experience to illustrate this. This is the first time we are using two lists to simulate two different columns of a data set. We need the mean value of both RDDs, so let's calculate it Everything works as expected. Since now we want to access each entry of rddX and Y at the same time, we have to zip them together. We obtain an RDD with tuple values instead of scalars where each tuple borrows one scalar value from each of the source RDDs. Note that this somehow resembles the behavior of a table in a relational database. Now we are able to define operations on scalars coming from x and y at the same time. In order to access the tuple in the lambda function, we have to use x and y instead of x only. So let's subtract the mean from each value of x and from each value of y and multiply the results and create the sum of the contents of the whole RDD. Again, we have to normalize using the total number of elements in the RDDs. Let's store this value in a variable for later usage. This is already quite nice, but we want to have values between -1 and 1. So this is defined as correlation and takes basically the covariance and divides it by the product of the standard deviations of each column. Since we need the standard deviations, let's take code from the previous lecture and modify it. First we calculate the standard deviation of rddX. Let's add n as a total number of entries in either of the two RDDs. For Y we can repeat the same step. We print the value. Finally, we just can calculate the correlation by dividing the covariance by the product of the standard deviations from each column. As expected, we obtained the value of plus 1. Since both RDDs contain the same data, therefore, are totally correlating with each other. Now let's reverse the second list. We can see that we obtained the value of -1 for the correlation which perfectly make sense since both lists are having an anti-proportional relationship with each other. Let's see what happens if we use two random lists, which basically means that there is no correlation at all. As expected, the value for the correlation is very close to 0, basically indicating that there is no dependency between the two columns. So these measures are highly useful because they tell us which columns are interesting and which ones we can basically ignore because they just follow the behavior of another one. But what if we want to get an overview about the correlation of all columns at once? So let's create a correlation matrix. Consider we have four columns on our data set. Column 2 contains two times the value of column 1, so the correlation is 1. Column 3 contains the inverse of column 1 so the correlation is minus 1. Column 4 contains random numbers in respect to the numbers of column 1 so the correlation is 0. Let's start to correct the matrix by putting the column names on the rows and columns of the matrix. The correlations between the columns themselves are always 1, so let's put the values there. Note this creates a diagonal matrix which is mirrored at the diagonal containing 1s. So between column 1 and 2 the correlation is 1. So let's put in the value to the matrix. And as the matrix is diagonal, let's put the value on the inverse relation as well. Between columns 1 and 3 the correlation is -1, so let's put it to the matrix as well. And of course, don't forget the mirrored value. What about column 1 and 4? We've already seen before that the correlation is 0, since the values of column 4 are random in respect to column 1. Again, don't forget to update the inverse relation. So what about column 2 and 3? The correlation is -1, because column 3 also somehow contains the inverse of column 2. So what about column 2 and 4? Of course, correlation is 0 as well. The same holds for 3 and 4 since also an inverse list is not correlating to a random list. So this leads us to the following correlation matrix which completes our task in the mathematical sense. But it contains a lot of redundancy, so let's remove the inverse relationship values first and then remove also the redundant reflexive relations. So now, we have an overview of all relationships between all columns regarding their correlations. So for example, we know that column 4 is highly informative because it doesn't correlate at all with all the other columns. This is either an indication that it contains a lot of information, or that it is completely not related to the rest of the data at all. Here experience and domain knowledge of the system under observation come into play. So again we have to see how to implement this as a parallel program in Apache Spark, so that you actually could run it on petabytes of data. Let's start with column 1. Again, this is just a list of numbers ranging from 0 to 99. Column 2 ranges from 100 to 199. Column 3 basically is a reverse list, ranging from 99 to 0. Remember, this gives us a negative correlation. The reversed function returns a lazy non-serializable object, therefore, we have to materialize it in order to have it sent over a network using the list function. Finally, column 4 contains random values between 0 and 99. We achieve this by random sample from the range of 0 to 99, 100 times. Of course, we have to import the sample function first, so let's check if the code compiles. Okay, it looks nice. We actually could now write our own code on top of the RDD API, in order to create a correlation matrix. But now we will use the MLlib, the distributed machine learning library of Apache Spark in order to do this, again, also using the RDD API underneath. This means we have to import the library first. Let's check if the code compiles. Okay, fine, now we have four individual RDDs. But the library we want to use expects a more relational table-like structure. We can use the zip function from the RDD API in order to merge two RDDs by having the zip function returning a new RDD where each element of the source RDDs gets represented as a member of a tuple. Here you can see the result. This is an RDD containing tuples instead of scalar values. And in fact, containing the scalar values of the column 1 RDD as first member of the tuple and the scalar values from the column 2 RDD as second tuple member. Let's add a third column. Now we get a somehow nested table structure, which is not exactly what we want. But no worries, we'll take care of this later. Let's add column four. Again, we increase nesting, so let's get rid of the nesting by flattening the nested tuples down to a single one. We can use the map function for this in order to apply a function to every element of the RDD. This function takes as input a nested tuple corresponding to the nested tuple structure of the current RDD. The output of the function is a flat tuple containing values from each of the RDDs. Now, we are almost there. But we don't want to have tuples in our RDD. We want to have a range with four elements per RDD element. Therefore, we map again and apply a transform function to achieve this. So this is what we actually need. Doesn't it look a bit like a relational table? As always in data science, data preparation is more than 80% of the work. The actual application of the algorithm is just a single line, so we are done. This matrix looks exactly like the one from the lecture. As you have seen, covariance is explaining the dependency of two columns in a data set. Correlation is further normalized by the standard deviations of each source in order to create a comparable measure between -1 and 1. The covariance matrix is a very powerful tool to display the dependencies between all columns in one single view. We are only able to see individual columns as vectors or columns in a table of a data set. In the next video you will learn that you can see data as points in a multidimensional vector space. This will change your life, so stay tuned.