0:11

Welcome to the introduction to statistical forecasting module.

Â During this section of the course, we will explore fundamental statistical methods

Â which are useful in using data to develop forward expectations or forecasts.

Â Course participants are assumed to have had some previous exposure to statistics.

Â Though we will provide reference materials to concepts presented in this module.

Â We will explains the statistical concepts employed in the following analyses.

Â But encourage participants to further their study independently.

Â We will spend a bit longer introducing concepts in this module than we have in

Â the others.

Â Given the statistical nature of these concepts, it's important that participants

Â spend time to understand the statistical methods employed in this module,

Â as they are powerful, but nuanced.

Â In order to use data to produce statistical forecast,

Â we need to understand regression analysis.

Â Regression analysis is one of the most commonly used statistical methods to

Â produce data driven forecast.

Â Simple lineal regression analysis uses one variable,

Â the independent variable to explain another variable, the dependent variable.

Â For example, you might use a person's height to explain and

Â predict a person's weight.

Â A thorough explanation of linear regression is beyond the scope of

Â this module, but we encourage course participants to study this powerful

Â statistical method, or to take one of the many related courses on Coursera.

Â There are a few concepts that you should understand, at least at a high level

Â before we proceed to their application in our Excel problem sets.

Â We encourage you to spend time studying these concepts independently

Â if you are unfamiliar.

Â Standard deviation is a measurement of the average dispersion of values in

Â a data set around their average value.

Â That is how spread out the data are from their average value.

Â This is related to variance but

Â it is more frequently used to describe the average dispersion of a data set.

Â The implication follows that a variance in the highest standard deviation should

Â imply lower confidence in the outputs of a statistical forecast using the data.

Â 2:27

Variance, as previously mentioned, also mentions how far

Â on average a set of data values are spread out form their average or mean.

Â Higher variance in your data should result in you being less confident and

Â the accuracy of your prediction because your data is so

Â widely spread out around their average value.

Â Again, variance has the same implication as standard deviation,

Â it is simple squared to amplify dispersion from the main value.

Â Covariance is a measure of how two variables change together.

Â 2:59

Covariance is not normalized, meaning that there's no meaningful way to compare

Â covariances across different variables.

Â We need another measurement to compare the way two variables change together

Â in order to draw meaningful conclusions.

Â 3:13

Correlation provides us this normalized measure of covariance that is,

Â how two variables change together.

Â Normalization of covariance results in a measure which we can use to meaningfully

Â compare how two variables move together.

Â 3:28

This is a normalized measure, and so it results in a value between -1 and

Â 1 and gives an objective indication of the relationship between two variables.

Â It also tells us the direction of that relationship, positive or negative.

Â Values close to zero indicate that the relationship is not very strong.

Â R-squared, or

Â the coefficient of determination Is a number that indicates the proportion

Â of the variance in one variable that is predictable from the other variable.

Â A higher R squared value indicates a better fit of our statistical

Â measurement of the relationship between the variables to the data itself.

Â The definition above is important to note,

Â it has a similar interpretation as correlation

Â though since it is squared the direction of the relationship cannot be determined.

Â 4:23

Let's have a high level overview now of linear regression.

Â Using linear regression we can quantify the relationship

Â between changes in the independent, or input, variable and

Â changes in the dependent, or outcome variable.

Â For example, let's look at the relationship in the variables Y,

Â X, m and B as shown on the slide below.

Â We see here Y=mX+B this relationship could be read as,

Â Y is equal to m multiplied by X and added to B.

Â You may recognize this as the classic slope intercept form of an equation for

Â a straight line, as we are all taught in algebra.

Â Simple linear regression analysis of a dataset

Â may result in a similar quantified relationship for a dataset.

Â For example, our regression analysis may result that m=3 and B=100.

Â Using this quantified relationship,

Â we can input values of X to predict values of Y, which we don't have in our dataset.

Â For example, if we input a value of 100 for X,

Â we can use this quantified relationship to produce a value of 400 for Y.

Â We will explore regression analysis in a simplified analysis.

Â First though, we must develop a thesis, or hypothesis, for

Â our forecasting relationship.

Â A few additional statistical forecasting concepts that are important to

Â understand include the Y-intercept.

Â Which is the point where the graph of a function, or in this case,

Â the graph of our relationship between our two variables, intersects the Y-axis.

Â This is the value which cannot be explained by our regression analysis and

Â is constant despite our measured relationship between two variables.

Â This also represents the value of our dependent variable

Â when your independent variable is equal to zero.

Â 6:45

We show two examples of two data sets with different standard

Â deviation variants, correlation, and R squared values.

Â Notice that higher standard deviation and variants results in

Â a much more dispersed or spread out data set around the mean value.

Â While lower standard deviation and

Â variance result in a more tightly dispersed data set around the mean.

Â Let's discuss our specific example.

Â It will be simple and explore the use of regression analysis to predict

Â visits to a website based on the number of social media mentions of the site.

Â Our thesis here is that it's reasonable to think that as mentions

Â of a website on social media increase.

Â The number of people who visit the site will also increase, and

Â result in increased web traffic or page hits to the site.

Â Starting with a thesis like this is fundamental to regression analysis.

Â 7:45

This statistical method can be used to determine if

Â there is a correlative relationship between two variables.

Â In our case, the relationship between the number of social media mentions,

Â the independent variable and visits to our variable.

Â Dependent variable,

Â we have measures of the strength of this relationship which we will discuss later.

Â If these measures indicate a strong relationship we can conclude

Â that social media mentions and visits to our website are related.

Â A strong relationship detected via linear regression analyses does not,

Â however imply a causal relationship between the two variables.

Â Said differently, it does not imply that traffic

Â increased to our website because of social media mentions.

Â But, rather, that increased web traffic to our website

Â tends to increase with increases in social media mentions.

Â This is an important distinction,

Â though causal analysis is beyond the scope of this module.

Â In order to complete the exercises in this week's problem set

Â you'll need to enable the Analysis ToolPak in Excel.

Â We've included a link to instructions on how to do this in the reference materials.

Â Please take a moment to ensure that you have the ToolPak enabled

Â before continuing.

Â You'll know that you've successfully enables the ToolPak if you see it on

Â the Data tab of the ribbon, as in our image below.

Â [MUSIC]

Â