Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

Loading...

From the course by University of Houston System

Math behind Moneyball

24 ratings

University of Houston System

24 ratings

Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

From the lesson

Module 10

You will learn how Kelly Growth can optimize your sports betting, how regression to the mean explains the SI cover jinx and how to optimize a daily fantasy sports lineup. We close with a discussion of golf analytics.

- Professor Wayne WinstonVisiting Professor

Bauer College of Business

In this video we'll talk about a very important statistical concept called regression towards the mean.

And then we'll apply it to sports betting. Okay, so, Sir Francis Galton, I believe, was the first person to come up with this.

Sort of like a biologist who looked at genes. And so what he observed was the following. Tall parents have tall kids, but their kids are closer to average than the parents are.

Okay, and so the way you can explain this, we've talked briefly about correlation. Correlation, you can find it with the correl function.

Is a measure of linear association between two variables. So it's between minus 1 and plus 1, and it's unit free.

So a correlation near plus 1 means that basically your two variables, let's call them x and y, tend to go up and down together. Correlation near minus 1, when x goes up, y tends to go down, and when x is above average, y tends to be below average. Correlation near 0 means a weak measure of linear association. But what you can show is if you run a regression with one dependent variable, so you predict y as a function of x, let's say.

And the answer is, it's not blowing in the wind, it's the correlation times how many standard deviations x is above average.

So if your correlation between a parent's height and a kid's height is 0.5, and x was the height of the father and y is the height of the son, you'd expect the height of the son to be 0.5. Take the number of standard deviations above average the father's height is, cut that in half, and that's how many standard deviations above average you would expect the son's height to be. And since correlations between plus 1 or minus 1, in terms of standard deviations, y will be closer to its mean in standard deviations than x is, in other words, in terms of z-score. And we'll see where this has an application in sports in a couple of minutes. But let's take a simple example. Okay, so let y be let's say the daughter's height.

And so let's suppose the mean of x is 69 inches. Mean of y is 65 inches. Let's suppose the standard deviation of x is 3 inches, to keep it simple. Standard deviation of y is 2 inches.

Okay, well, if the father is two standard deviations above average, the correlation's 0.4. We predict the daughter to be 2 times 0.4 or 0.8 standard deviations above average.

So you would take the average height, which is 65 plus 0.8 times the standard deviation of the daughter's, which is 2, and you'd predict 66.6 inches. But the point is, the father was two standard deviations above average, you'll predict the daughter to be 0.8 standard deviations above average. So what does this have to do with sports? Well, a couple of things.

Okay, so in Sports Illustrated, someone's on the cover and then they say, well, it's a jinx. They don't do as well after they're on the cover. Well, of course not. They were on the cover because they did something really good, unless they're members of the FIFA selection committee, I guess. But usually you're on the cover of Sports Illustrated when you do something that's really good.

The Madden video game jinx, when someone's on the cover of Madden, it's because they had a great year. And so there's nowhere to go but down because you had such a great year, maybe you'll get hurt.

Okay, so where this has a real application in sports gambling if y let's say is predicted wins next season for an NFL team.

Okay, I don't have the Vegas stuff immediately available or don't have the time to really do the example. I mean, maybe we'll make it a homework or a test question. But basically you'll see that y tends to be quote the predictive wins next season tends to be closer to average wins, which is eight for an NFL team than wins last season. Now why is that? If a team goes 12 and 4, what does it mean? Their players did well and their key players probably didn't get hurt. You expect them to do worse, or they'll maybe lose people to free agency. If a team was 4 and 12, well, probably they had a lot of injuries or they have young players they're developing. You expect them to win more games. So we can sort of look at this with the NFL and you can see why the phenomenon's called regression towards the mean. Your prediction is closer to the mean than your independent variable. Now there's basically a stronger tendency to brush towards the mean in the NFL than the NBA, and I think that's because a couple reasons. The NFL, they vary the schedule. If you play poorly, you get an easier schedule. If you play well, you get a harder schedule. The NBA doesn't do that.

And the other things is if you got LeBron James, you're going to have a good team no matter else who's on that team, or Steph Curry. And so basically you got the, one player can make a much bigger difference in basketball. And in football I don't think any one player, except maybe possibly the quarterback, could make a huge difference. But any one great NBA player will make a big difference in the performance of an NBA team. Okay, so let's try and examine this. We have data, we have 2012 wins for every NFL team and 2011 wins for every NFL team. So we want to predict y from x. Now there's a couple of ways you can do this, and we'll get the correlation. But you can do what's called the trend curve. You can graph this stuff. Now you should put what you want on the y-axis to be on the right, so I flipped this around. And if I do Insert > Scatter Plot, I can get a nice scatter plot of this data. X-axis is 2011.

And you can do right-click, Add Trendline. You can do straight line, show the equation. And show the r square.

And you can see that's a line of positive slope, but the correlation is the square root of the r squared, where you take the same sign as the slope of the line. So if I take the square root of that 0.0773, the correlation is 0.28 between the number of wins an NFL team gets one year and the number of wins they get the next year. So I can actually get all this stuff by their Excel functions, which I don't think we've talked about.

I have them right here. But if I want the correlation between the 2012 wins, it doesn't matter which you put first for correlation.

Get 0.278. Now the slope of that regression line, we see it's 0.262. And if there's a slope function, you have to put the y column first. Actually it would be easier if I name this stuff. So we'll name this stuff Formulas > Create from Selection, names in top row. So if I take slope, there's a slope function. And if I hit F3, the 2012 plays the role of y. Hit F3, the 2011 plays the role of x.

And while you're at it, you can get an r squared. There's an RSQ function. Doesn't matter what goes first, but we'll put 2012 first. F3, 2011.

And you get 0.08 there, rounding off. Now these things we don't really need here. This would be the standard error of the regression which we talked about from the analysis tool pack. If you want to do the standard error how accurate this tends to be, you'd put 2012 first because it's what you're trying to predict, 2011 second.

Okay. So let's take an NFL team that was two standard deviations above average. I mean that's a really good. Let's suppose an NFL team won 14 games last year.

Now we know the average both years is 8, so we would predict them to only win about 10 games if they won 14 last year, which is pretty small, but to do that in terms of regression towards the mean. So 14 wins is how many standard deviations above average?

Well, you take 14 minus the mean of 8. You divide by the standard deviation of 2011. So that's 1.83 standard deviations above average. So you predict 2012.

0.509 standard deviations above average. And what that would simply be would be eight wins is average. You would take 0.509 times the standard deviation for 2012. And I got 9.57 instead of 9.54. I mean, that's virtually a wash there. I mean, I've rounded off a little bit. But that gives you a good feel how regression towards the mean works. You'll almost always expect the sporting team the next season to play closer to average than they played last season. I think that's enough said about that. Well, just one reference. Again, that book Thinking, Fast and Slow by Kahneman.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.