0:00

In this lesson we're going to talk about Pearson correlation,

Â which is oftentimes referred to as Pearson's r,

Â Pearson product-moment correlation coefficient, or the bivariate correlation.

Â And it's a way to determine the correlation between bivariate data,

Â which means data that has two variables.

Â But what is correlation?

Â Well correlation is a linear relationship, or lack thereof, between two variables.

Â And Pearson's r is a measure of the strength of that linear correlation.

Â So we have a nice little graph here to show you some different values for

Â Pearson's correlation.

Â Pearson's r can be between -1 and 1, inclusive.

Â So a negative value implies that there's a negative correlation between the two

Â variables.

Â If there is there is a positive value for Pearson's r,

Â then there's a positive correlation.

Â 0:56

Then a Pearson's r value of 0 means that there

Â is no correlation between data points.

Â So here we have a perfect correlation of -1.

Â Here we have a perfect positive correlation of positive 1.

Â And here we have a negative correlation that's not perfect.

Â Here we have the positive correlation that's not perfect.

Â And then here, it's very clear that these set of data points

Â have no correlation whatsoever.

Â 1:20

Okay, now that we have a understanding of what correlation is,

Â let's use some real data.

Â We're going to scroll down here and we're going to solve some dependencies.

Â But then we're going to go ahead and connect to our MongoDB Atlas cluster.

Â And we're going to go ahead and use the movies data set.

Â And specifically, we're going to be building a pipeline here that's going to

Â be looking for movie ratings and movie votes for those ratings.

Â And we're going to try to determine if there is a correlation between

Â the number of votes and the actual rating that a movie has.

Â So in this pipeline we're going to use match stage to make sure that we're

Â getting documents that have both non-0 values for ratings ad votes.

Â And then we're going to go ahead and use the project stage to remove _id and

Â keep the two values that we care about.

Â And we are going to go ahead and rename them to rating and votes.

Â Once we have our pipeline, we can go ahead and pass it to the aggregate command and

Â then turn it into a list.

Â And then from that list we can go ahead and turn it into a DataFrame,

Â using the from_dict function.

Â And now that we are in ourPandas DataFrame, we can go ahead and

Â take a peek at our data.

Â And, as you can see, we now have our data in our DataFrame.

Â And from here we can go ahead and

Â use Seaborn's joinplot method to visualize the entirety of our results.

Â It's also going to go ahead and fit our regression line on our results as well.

Â And there we go. And it looks like we do have some

Â correlation.

Â You can see we have a Pearson's r value of 0.15.

Â And we can see that,

Â moreover, just by looking at the data but without looking at the line of best fit,

Â we can see that as a movie's rating increases, so does the number of votes.

Â So there seems to be a positive correlation, even though that's a tiny

Â positive correlation between the rating of a movie and

Â the number of votes that it received.

Â 3:09

But let's go ahead and calculate Pearson's r by hand.

Â And this is the formula for

Â doing a single-pass calculation tf Pearson correlation by hand.

Â There's also a multi-pass form.

Â But we're not going to cover that in this lesson,

Â because the single-pass can actually be done in aggregation.

Â We'll first going to go ahead and do this calculation in Python.

Â And then, after we have seen how it's done in Python,

Â we're then going to go ahead and see how it can be done in aggregation.

Â So there's a bit of groundwork that needs to be done before we can go ahead and

Â calculate Pearson's r.

Â Basically, we're going to go ahead and go through here and find each of these terms.

Â For every value of x we're going to go ahead and subtract the mean from it.

Â We're going to do the same thing for y.

Â And then, for these pairs of values, we're going to go ahead and

Â multiply them together.

Â And then we're going to go ahead and use those differentials again from above,

Â and we'll calculate their square.

Â And then, once we have all these different values,

Â we can use them together to kind of create this formula.

Â The first thing I'm going to do is go ahead and

Â make a copy of our original data frame.

Â I'm going to call it exm.

Â 4:14

So the first thing we're going to do is calculate the mean of x and the mean of y.

Â So it's as simple as taking the sum and dividing by the total number.

Â We're going to store this in m_x and m_y.

Â And there you can see our mean for x is 6.3, so that our average rating for

Â a movie is 6.3.

Â And then we have the mean for y, which would be our average number of votes,

Â which is about 11,700.

Â We can now go ahead and calculate little x and little y,

Â as well as xy, and x squared, and y squared.

Â So here we're going to go ahead and map over all the values of x,

Â subtracting the mean.

Â We'll do the same thing for y.

Â We're then going to zip up our ratings and votes together and map over them.

Â And then multiply every pair together.

Â We're going to call that xy.

Â 5:04

We're then going to square every value for x and

Â y by mapping over all of those values.

Â And then we have x, y, xy, x squared, and y squared.

Â And then we're going to go ahead and

Â assign all these values into our data frame.

Â Now let's go ahead and take a look and see what that looks like.

Â 5:23

And as you can see, we now have a nice little data frame where we have

Â our original ratings, our original votes.

Â And then, for every one, we have an x value, a y value, an xy, an x squared,

Â and a y squared.

Â 5:36

Not that we have our data frame, we can go ahead and dive into the equation itself.

Â First we're going to just focus on the numerator.

Â We're going to call this top.

Â We're going to begin by by multiplying the number of elements,

Â which we've got up here, by the sum of all of those x, y multiples right here.

Â So we're just going to multiply those two together.

Â And now we have the product of those two stored in this variable.

Â Next we're going to go ahead and sum up all the x values and the y values, so

Â all of the ratings and all the votes.

Â Multiply those two guys together, sort of, in that variable.

Â And then finally we're just going to take the difference between those two and

Â we're going to call that top.

Â And that's a very large number.

Â 6:14

Now, let's go ahead and focus on the bottom part of our equation.

Â And for the moment we're going to ignore the square roots and

Â we're also going to divide it into a left part and a right part.

Â So here, we're focusing on the left part.

Â And first we're going to multiply the number of elements by the sum of

Â the squares, and we're going to call that product_sum_x2_elements.

Â And then we'll go ahead and subtract the sum of the squares of x, or ratings,

Â from that.

Â And that will be on our bottom left.

Â We can now go ahead and focus on the right-hand part of our denominator.

Â And this is very similar to the left-hand side,

Â but now we're concerned with y instead of x.

Â So we're going to do the same thing,

Â we're going to multiply the number of elements by the sum of the squares of y.

Â And we're then going to take that and

Â subtract the sums of the y squareds from that.

Â And we're going to short cut here and now we're just going to take the square roots

Â of the bottom left times the bottom right.

Â And that'll be our denominator.

Â And then, finding Pearson's r is as simple as dividing the top by the bottom.

Â 7:16

And we get 0.1464.

Â Let's go ahead and compare this with the pearsonr library from SciPy.

Â And as you can see we get the same number which moreover

Â is actually 0.146 the same as the 0.15

Â with some rounding that we got with Seaborn.

Â Both methods work, both doing it in Seaborn and

Â both doing it by hand, both work.

Â But they're both being done in Python, which is slower than it needs to be.

Â Not only is it slow, but

Â we're also transmitting a lot of data from MongoDB and sending it here to the client.

Â All that data could just be processed directly on our MongoDB cluster,

Â reducing the need for transferring data and doing this analysis in Python.

Â To remedy this, we're going to use MongoDB's Aggregation Framework.

Â Let's see how.

Â First thing first, we're going to go ahead and create aliases for our two values, and

Â y, just so we're speaking in the same terms as before.

Â And then, just like before, we're going to go ahead and

Â figure out the number of elements we have.

Â We're going to sum up the x's, sum up the y's, sum up the squares of x and y.

Â And sum up the multiples of x and y.

Â We're going to go ahead and insert these into a group stage and

Â then assign it to a variable called all_sums.

Â Next we're going to go ahead and assemble the top part of the equation.

Â Aside from using aggregation syntax, it's identical to what we did above.

Â 8:39

And similarly, for the denominator, assembling the left and

Â the right side is exactly the same as what we did above, but now just in aggregation.

Â And like before, assembling our bottom is as simple as multiplying the left and

Â right together and taking the square roots.

Â We're then going to go ahead and project out the correlation,

Â calling it m, just by dividing the top by the bottom.

Â We can now go ahead and assemble all of our stages together.

Â We are going to go ahead and do a match, like before.

Â We're going to go ahead and get all of our sums and

Â finally calculate our correlation.

Â 9:12

Now that we've assembled our pipeline,

Â we can go ahead and execute it by using the aggregate command.

Â And we're going to go ahead and

Â compare it against the other values that we've calculated.

Â And, great, we got the same results for all three variables.

Â The major difference here is that we didn't need to marshal any data into

Â a data frame and we were able to have the entire data set be executed, server side,

Â with MongoDB.

Â And that's how we calculate Pearson correlation in MongoDB.

Â