0:04

So we've looked at the histogram as a way of visually exploring our data.

Â If you wanted to summarize the data numerically,

Â let's talk about first the measures of central tendency

Â that we can use to to explain the data.

Â So we can we often look at the mean or the average of the data.

Â Another metric that you might use is the median or what is the middle value.

Â That is if you rank order your data from highest to lowest,

Â one observation falls in the middle of that data such that 50 percent

Â of observations are greater than that 50 percent of observations

Â are going to be less than or equal to that.

Â So we'll look at both of those commands both mathematically and within Excel.

Â There are also measures of dispersions that we can use to summarize the data.

Â The most common probably being looking at the standard deviation

Â or the variance that's observed in the data

Â that's giving us a measure of dispersion from the mean.

Â So from the average value, you might also look at the range

Â or the interquartile range to get a sense for how much variation is there in your data.

Â So these are the formulas that we can both the mathematical expression

Â for calculating the average as well as how we can derive

Â that using Excel the command equals the average

Â and then you would enter the cell range for where your data resides.

Â What Excel is going to do?

Â It's going to add up all of those values in that range,

Â divide by the number of observations to come up with the average.

Â The median if you wanted to calculate

Â that equals median and then the highlighted data range.

Â One thing to note and we'll take a look at this an example to customer value

Â in the second is that the median is not going to be sensitive to the extreme values.

Â So if you've got a lot of variation in the value of your customer some are really high,

Â but most customers are really low.

Â When you calculate the average that's going to be highly sensitive

Â to those extreme observations.

Â The median is not going to be sensitive because it doesn't care what the exact values are.

Â It just cares about finding the point where 50 percent

Â are above 50 percent are below that value.

Â Doesn't matter how far above or how far below

Â all that it's concerned with is the rank order.

Â All right.

Â So here's an example that's been where we've done some work looking at the customer base

Â of new customers signing up for telecom service

Â and then we calculate their lifetime value,

Â techniques that we'll look at when we examine customer analytics

Â with regards to studying, retention, and driving customer value,

Â really the story that we get depends on the metrics that we're using.

Â If we look at the average customer lifetime value based on this histogram turns out

Â that it's around $1200.

Â But if we look at the median customer lifetime value

Â that is you know the point at which half of my customers are going to be worth

Â less than this amount, half my customers are going to be more worth more than this amount.

Â Turns out that it's only $800.

Â That difference between the median

Â and the mean indicative of having a skewed relationship.

Â In this case, the data is going to be right skewed

Â if you look at our distribution as we go from left to right from low value to high value,

Â we see fewer and fewer observations.

Â The data seems to fall off.

Â So that's indicative of that right skewed relationship.

Â So which customers would we want to target?

Â Well, if if there was a way for us to identify the customers

Â who have that high lifetime value it'd be great for us to focus our efforts

Â on these customers but we need some way of identifying those individuals.

Â Maybe it's on the basis of demographic variables,

Â maybe it happens to be the region that they're in, maybe it's the method

Â of acquisition so if we have additional variables

Â that tell us something about the individuals

Â who have that high lifetime value then we can make an informed decision

Â of do we want to focus on those individuals where they are very valuable

Â but there aren't that many of them?

Â Or do we want to focus on customers who may not be as valuable to us,

Â but maybe it's easier for us to acquire them and there are a lot more of those customers.

Â So that's what we're going to look at when we talk about the customer analytics.

Â How hard is it to acquire these different customer groups?

Â What's the difference in value going to be

Â and what are the levers that we can pull to try to attract them?

Â All right.

Â Looking at dispersion, common measure is going to be the variance.

Â So I've given you here the mathematical derivation for the variance

Â as well as the Excel notation equals var.

Â And then highlighting the data range and you'll see

Â that in the derivation for the variance,

Â what we're really focused on is the key here is this piece.

Â It's the observation minus the average.

Â Now the reason that it's squared is we really don't care how whether it's a positive

Â or negative deviation from the average or from the mean.

Â So what we're concerned about is the magnitude of that deviation.

Â So the bigger the more dispersion there is around the mean

Â the higher the variance is going to be.

Â Variance can also be reported in terms of the standard deviation.

Â So stdev which is really just the square root of the variance.

Â Other formulas that you might use to cut to understand

Â how much dispersion is in your data we can calculate the range.

Â So what's the largest value minus the smallest value.

Â So here we can use the max and min commands to calculate just the size of that range.

Â You might also look at what's referred to as the IQR where the interquartile range.

Â So this is the range that contains the middle 50 percent of the data.

Â There's not a built in command to say give me the IQR

Â so we can do is use the percentile command.

Â And so what we're doing in this equation is say let's extract from the data

Â what value is that the 75th percentile and we'll subtract from that

Â the value that's at the 25th percentile that difference giving us the interquartile range.

Â So these are methods for summarizing data both in terms of the average,

Â the median, the central tendency and in terms of dispersion.

Â One thing that we always have to contend with are outliers in our data.

Â Those extreme observations that can heavily influence our data.

Â In some cases those outliers might be caused by coding errors.

Â You know sometimes in data you'll see that the code 9 9 9 9

Â is used to refer to data that's missing.

Â Well, we actually have to remove that before we conduct our analysis

Â because otherwise it's going to skew things.

Â So when we detect those outliers it may be a reason to dig in further

Â and understand what's going on.

Â It is possible that there are just mistakes

Â but any assumptions that you're making about how you deal with those outliers those

Â are conclusions that should be noted in any report that you're producing.

Â Another aspect of quantitative data might be time series or a temporal dimension.

Â So these are time series plots that were produced using television tuning data.

Â So data collected from set top boxes for two television programs

Â in the- I believe early 2000s when we're collecting the data America's Next Top Model

Â and Pussycat Dolls Presents Girlicious,

Â the dips in the audience level or the dips in the ratings

Â caused by commercial breaks so the audience staying tuned to the program,

Â commercial break comes on, people tune away, come back when the program resumes.

Â Well that's a characteristic of this type of data.

Â Audience levels decline during commercial breaks.

Â So we were in any type of analysis,

Â that's something that we are going to need to control for.

Â Another popular time series data is stock performance data.

Â So we can see how do stocks or do indices tend to move together

Â or are they tending to move apart.

Â Well getting to looking at time series data

Â and building forecasting models in a little bit.

Â But just to foreshadow the direction we're going in,

Â one component that we're going to look for is the trend in the data.

Â So for new product that's launched as time goes on sales may ramp up.

Â Something else that we might also look at is a cyclical pattern.

Â Is it related to seasonality?

Â Is there a pattern in here related to the general state of the economy?

Â These are factors that we need to do that we need to know about

Â if they're influencing in this case sales

Â because if we're going to be making forecasts going forward those are variables

Â that have to be included in the predictions that we're making.

Â All right.

Â We've looked already at how do we examine relationships

Â that may exist among categorical variables.

Â Well we want to do the same thing when we look at relationships

Â that exist among quantitative variables.

Â A scatterplot is the common visual way of doing that.

Â So in this particular case, comparing the daily returns

Â for the S&P 500 with the Verizon stock.

Â So on the x axis it's the daily return for the S&P,

Â on the y axis the daily return for a Verizon.

Â And you'll see from the trend line that's been layered on here

Â that there is generally a positive relationship.

Â On days when the S&P went up, Verizon also tended to go up.

Â So we can look at the direction of the relationship.

Â In this case, we fit a linear curve a linear trend to that may not always be linear.

Â That's one thing that we're going to keep in mind.

Â We're also going to look at how much dispersion there is.

Â There seems to be a lot of dispersion think of around this line.

Â So from this point on all the way down to this point, a lot of dispersion

Â that we're seeing in that scatterplot means

Â that if I wanted to know what the Verizon price was going to be on a given day,

Â well I can't just use the S&P because it seems like that's not explaining everything.

Â When we get into talking about linear regression

Â that's one of the pieces that that's going to be key for us.

Â It's trying to reduce the amount of uncertainty that we have.

Â The more information we have the better fit we should see.

Â If we're looking at trying to quantify the relationship between two variables,

Â a common technique that we're going to use is to look at the correlation.

Â What the correlation does is assess the linear relationship and that's going to be key

Â because not all relationships are going to be linear, as I'll show you in a second,

Â sometimes that correlation is going to be indicative of the nature of the relationship.

Â Other times, it might not be that informative.

Â So we really have to conduct exploratory analysis to understand the patterns

Â that exist between quantitative variables.

Â Within Excel, equals correl, that's going to be the command

Â that we use to calculate correlation, highlight the x range in the data,

Â then we highlight the y range in the data.

Â Those are going to be separated by a comma

Â and all that we're going to get as output is a number between negative 1 and 1.

Â So negative 1 indicating a perfect negative linear relationship,

Â positive one indicating a perfect positive linear relationship.

Â Mathematically and this is something that I hope you never have to calculate

Â by hand but this is the mathematical derivation behind the correlation coefficient

Â and the piece that I want to just draw your attention to just to give you some intuition

Â for what this number really means is going to be this set of terms.

Â So what we're doing is for each observation we're calculating

Â the deviation from the average x value

Â and we're calculating the deviation from the average y value

Â and multiplying them together adding that piece up.

Â Well, what does the positive correlation mean?

Â Well a positive correlation means that when the x value is above average as we see

Â which will be given by this first term the y value also tends to be above average.

Â Both of them are positive.

Â When the value is below the average, it's going to be negative.

Â When the y value is below its average value that's going to be negative.

Â The product of the two giving us a positive relationship.

Â If we have a negative relationship when X is above its average value,

Â Y is below its average value and vice versa.

Â So that's that's where the correlation concept is coming from.

Â So one you know we can look at this in terms

Â of the financial performance so I pulled information.

Â This is data from 2010 on the stock performance of some telecom companies.

Â And if we look at Verizon, Comcast, and AT&T positively correlated with each other,

Â also positively correlated with the S&P 500.

Â So probably no surprise that these finance these stocks

Â and the S&P 500 generally move in the same direction.

Â Right.

Â So this is intended to give you a sense for the extent of dispersion

Â that you might see depending on the level of correlation,

Â the middle plot that has an r value of 0.9, you know, very high correlation there,

Â very tight cluster looks like pretty much a straight line whereas

Â when we have a lower correlation coefficient the 0.4 on the right or the

Â or the -0.7, you see there's more dispersion there,

Â so there still is a linear relationship.

Â If I were to superimpose a trend line, we would capture a negative relationship.

Â But the fact that there is a lower correlation coefficient time

Â is that there is going to be more dispersion there.

Â So these are all examples when the data actually suggests a linear relationship.

Â But as I said earlier it's not always going to be a linear relationship

Â and that's where we need to be careful.

Â So one example of how the correlation can be misleading

Â is that we have outliers in our data.

Â We've got a massive points in this case near the origin, near the lower left

Â and then we've got a single outlier.

Â Well if we try to draw best fitting line because that outlier

Â is going to influence the trajectory of that line,

Â we're actually going to be told that you know

Â what there is a strong linear relationship here,

Â but in reality, it's that one outlier that's doing it.

Â If we were to discard that and just look at that massive points,

Â we would not get nearly as strong a linear relationship.

Â The other piece that we have to be careful about

Â is when we're dealing with non-linearities.

Â So in this case we have if we look at the lower right we have a correlation of zero.

Â Well that doesn't mean that there's no relationship between x and y,

Â what it means is that there is not a linear relationship between x and y.

Â You know this relationship is a negative quadratic relationship.

Â So, y is related to x squared.

Â Well, it's a very strong quadratic relationship but it's not a linear relationship.

Â That's why we're getting the correlation of zero.

Â So doing this visual inspection using scatter plots,

Â using histograms a very important part of the data analysis that we need to do.

Â If we were to just look at the correlation coefficient,

Â we see that there's no linear relationship, we might erroneously conclude

Â that there's no relationship at all.

Â All right.

Â So to recap the techniques that we've looked at we've talked about techniques

Â for examining individual quantitative variables.

Â We've also looked at techniques for examining the relationships

Â among the quantitative variables and correlation and scatterplots good for looking

Â at the relationship among these two quantitative variables.

Â We'll come back to examining these relationships more formally

Â when we start getting into prediction and we start using linear regression

Â as one of our go to tools in that forecasting.

Â