0:00

This lecture is about forecasting, which is

Â a very specific kind of prediction problem.

Â And it's typically applied to things like time series data.

Â So, for example, this is the stock of information for

Â Google on the NASDAQ, and so is this symbol GOOG.

Â And you can see over time that there's a

Â price for this stock and it goes up and down.

Â So this introduces some very specific kinds of dependent structure and

Â some additional challenges that must be

Â taken into account when performing prediction.

Â And so, first of all, the data are dependent over time, and so, that alone

Â makes prediction a little bit more challenging

Â than it is when you have independent examples.

Â There's also some specific pattern types that should be paid attention to.

Â Trends, such as long term increases or decreases, seasonal

Â patterns are very common in this kind of data.

Â For example, seasonal patterns over weeks, months, years, etc.

Â Cycles, patterns that rise and fall periodically over

Â a period that's longer than a year, for example.

Â Here, the subsampling and the training and test can be a little bit

Â more complicated because you can't just

Â randomly assign samples into training and test.

Â You have to take advantage of the fact that there's actually specific

Â times that are being sampled and that points are dependent in time.

Â 1:10

Similar issues arise in predictions of spatial den, spatial data.

Â For example, there's dependency between nearby observations and there may

Â be location-specific effects that have to be modeled when doing prediction.

Â 1:23

Typically, the goal here is to predict one or

Â more observations into the future and all standard prediction

Â algorithms can be used, but you have to be

Â a little bit cautious about how you use them.

Â 1:33

So, one thing to be aware of is

Â that you have to be careful of spurious correlations.

Â So, time series can often be correlate for reasons that

Â do not make them good for predicting one from the other.

Â So, if you look at, you can go to Google Correlate to

Â correlate different words over time, the frequency of different words over time.

Â And so, for example, here you can see a correlation between the

Â Google stock price, shown in blue, and solitaire network, which is in red.

Â And so, those don't necessarily have anything to do with each other at

Â all, but they have a very high correlation, and you might think you might

Â be able to predict one from the other, even though in the future,

Â they might diverge substantially because they aren't

Â necessarily related to each other at all.

Â 2:12

It's also very common in geographic analysis.

Â This is actually a cartoon from xkcd

Â that shows that heat maps particularly population-based

Â heat maps had very similar shapes because of the place where many people live.

Â So for example, the users of a particular site or the subscribers

Â to a particular magazine or the consume, consumers of a particular type of

Â website may all appear in the very similar places because the highest density

Â in population in the United States is over here on the Eastern seaboard.

Â And so, you see very similar heat maps of a

Â large number of individuals at all of those different places.

Â You should also beware of extrapolation.

Â So this is a kind of a funny example that shows what happens

Â if you extrapolate time series out without being careful about what could happen.

Â So this shows on a long scale the winning time of a

Â large number of oh sorry, of races that occurred at the Olympics.

Â The blue times are men and the red times are

Â women, and these authors of this paper extrapolated out into

Â the future and said that in 2156 that would be

Â when women would run faster than men in the sprint.

Â And while we don't know when that, when or when that may or may not

Â occur, one thing that was pointed out is

Â that this kind of extrapolation is very dangerous.

Â Eventually at some time in the future, both men and women

Â will be predicted to run negative times for the 100 meters.

Â And so, you have to be very careful

Â about how far out you extrapolate from your data.

Â 3:44

So, I'm going to show a quick example of some

Â forecasting using the quantmod package and some Google data.

Â So, if I load this quantmod package and I can, I

Â can load in a bunch of data from the Google stock symbol.

Â And I can get it from the Google finance data set.

Â And so if I look at this Google variable, I get the open, high,

Â low, close, and volume information for a particular Google stock from

Â the 1st of January, 2008 to December 31st, 2013.

Â 4:19

So I can summarize this monthly and store it as a time series.

Â So I can use the two monthly variable or

Â function to convert that to a monthly time series.

Â And I can just take the opening information, and then I

Â can create a time series object using the ts function in R.

Â And if I plot that, I can see here's the

Â monthly opening prices for Google over a period of seven years.

Â 4:45

So, an example time series decomposition would decompose this

Â time series into a trend, any kind of consistent

Â pattern, a seasonal pattern over time, and cyclic patterns

Â where the data rises and falls over non fixed periods.

Â 4:59

And so, one way that we can do this is with the decompose function in R.

Â So if I decompose this in an additive way, then I can see that there's

Â a trend variable that appears to be an upward trend of the Google stock price.

Â There also appears to be a seasonal pattern, as well as

Â a more of a random cyclical pattern in the data set.

Â So this is decomposing this series here into a

Â series of different types of patterns in the data.

Â So here for training and test sets, I have to

Â build training and test sets that have consecutive time points.

Â So here I am building a training set that starts

Â at time point 1 and ends at time point 5.

Â And then a test set that is the next consecutive sets of points after that.

Â So that way, I can always build a training set and apply it to a test set

Â that have consecutive time points that show the same

Â sort of trends that I've observed in my data.

Â So there's a couple different ways for doing forecasting.

Â One is to do a simple moving average, which in another words, it

Â basically averages up all of the values of, for a particular time point.

Â And the prediction will be the average of

Â the previous time points out to a particular time.

Â You can also do exponential smoothing.

Â In other words, basically we weight near-by time points as higher

Â values or by more heavily than time points that are farther away.

Â So there's a large number of different

Â classes of smoothing models that you can choose.

Â 6:30

And for exponential smoothing, you can get an, you can fit a model where you have a

Â different choices for the different types of trends that you might want to fit.

Â And then when you forecast, you can get

Â a prediction that comes out of your forecasting model.

Â And you can also get sort of a prediction bounds for

Â what are the possible values that you could get from that prediction.

Â And you can get the accuracy using this accuracy function,

Â so you can basically get the accuracy of your forecast using

Â your test set, and it will give you root mean square

Â to error and other metrics that are more appropriate for forecasting.

Â 7:09

I've obviously gone through this very fast and so, if you want more

Â information, there's actually an entire field

Â dedicated to forecasting and time series prediction.

Â And I would highly recommend Rob Hyndman's Forecasting: principles and practice.

Â This is a free book that's online and it's really, really good, and

Â has a lot of information about how to get started at, in forecasting.

Â So the cautions are to be wary of spurious correlations.

Â Be very careful about how far you predic, predict out into the future

Â with express, extrapolation, and be wary

Â of dependencies like seasonal effects over time.

Â If you would like information on you, for financial prediction and financial

Â forecasting, the quantmod and quandl packages

Â are also very useful in that area.

Â