Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

Loading...

From the course by University of Houston System

Math behind Moneyball

24 ratings

University of Houston System

24 ratings

Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

From the lesson

Module 1

You will learn how to predict a team’s won loss record from the number of runs, points, or goals scored by a team and its opponents. Then we will introduce you to multiple regression and show how multiple regression is used to evaluate baseball hitters. Excel data tables, VLOOKUP, MATCH, and INDEX functions will be discussed.

- Professor Wayne WinstonVisiting Professor

Bauer College of Business

Okay, in the next video, we'll introduce you to the incredibly ideas behind multiple regression. So, we're filed in zero regression data start.

Okay. And so, if you have runs scored in a season, you'll see in several videos from now, you might say, if I know how many singles, doubles, triples, home runs, walks plus hit by pitcher a team has, I can predict how many runs they score, and then develop, a way to figure out how good a hitter is based on this regression, you should see. So, another example of why, might be the QB rating in the NFL.

Incompletion percentage. And, if you know those four things, you can really nail very accurately or determine very accurately what is a player's quarterback rate. Well, then in basketball, you want to predict how many NBA team wins in a season.

Well, turns out there are four factors, me and Oliver use to had ESPN analytics came up with this great idea. And the four factors are based on how well you shoot as opposed to your opponent. How often you turn it over, how often you go to the foul line, and how well you rebound.

Okay. And, so those would be three examples of the multiple regression that we will have and, of course, there are many, many more that you could come up with in sports. Okay. So, if you have a dependent variable you want to predict from independent variables x1 through xn. With the model of multiple linear regression, okay, we're going to do multiple linear regression. We'll see why the model's called that. You assume your best guess is that, y equals some constant, beta 0, plus another constant times the first independent. And, you just multiply each independent by a constant.

And, why do we call this multiple linear regression? It's linear because we're multiplying the independent variable by a constant. Now, there are times when you want non-linear regression. If we have time, we'll do an example. For instance, to predict player performance based on age. You don't want it to be linear, because that would say, yes, a player gets older either always gets better or always gets worse. And usually, there's an age to which a player improves, and then they start getting worse, probably around age 27 to 30 depending on the sport.

Okay. Now, it's a linear because you're multiplying each independent variable by a constant. It's a multiple because there's more than one independent variable. And it's regression because we say, we're regressing y on the xs. Okay. So, we're going to give you an example of this and show you how to run regressions in Excel. So, this comes from my book Marketing Analytics.

If you're in the marketing business or have any job involving analytics, this is a bit more advanced book than the Excel book that I've recommended. But I mean, it's gotten really good reviews on Amazon, maybe the best reviews of any book I've wrote, buyers say it's pretty good. But, it's John Wiley, I think it was 2014. Okay, and there was a chapter ten there. Let me just check the chapter, but I think it's ten.

Yeah, chapter ten discusses, in great detail, multiple linear regression. Okay, and the DA be it, the data analysis of business model, we'll discuss a few chapters. But chapter ten goes into much more advanced detail than data analysis business model. And there are plenty of good books on. And, if you're going to be a sports analytics black belt, so to speak, you got to really know everything about regression. Sometimes it's called econometrics, if you've taken econ department. And, they were focused mainly on analyzing financial or economic data. Okay. And that's actually what we have here so it's called cross-sectional data. It's a certain point in time, think 2007. It comes from the great Economist Pocket World Guide in Figures. And, if you've gotta read one source of information for your business knowledge, I recommend The Economist. Gives you a balanced view of the world. New York Times is really liberal, the Wall Street Journal is really conservative. I read both and I learn from both. But basically, the economist helps me think on my own about the important issues of the decade. Sometimes as, I'm sure the cover next week will be FIFA corruption. I have not got it in the mail yet, but, okay, but here's some economic data. We want to predict the countries in Europe sales per capita of computer products [INAUDIBLE]. So for instance, Austria sold $112 per person of computer products in the year we're looking at.

And, what are some variables that could help you predict this? Gross National Product per person, the bigger the gross national product, the The bigger the country's economy. The unemployment rate. The bigger the unemployment rate, the less you probably expect they spend on computers. And percentage of GNP spent on education. The more that's spent on education, the more you'd think the country would spend on computers.

Now, why would you want such a model? Well, you could set sales quotas for countries, if you were working for Dell, HP, or Lenovo, or a computer manufacturing company, what you should sell on the country based on the economics of that country. So, the idea is, you'd like to predict the yellow from the red, so we going to run a regression wondering about some things and concepts. So, how do you add the tool to run regression? It's under data tab, data analysis but how did I get that? Okay I go File, Options,

Add-ins. And you check the analysis tool bag. There's two choices that you'll see. Check the first one. And, that should bring your input. So, if I would go File > Options.

Add-ins. Go, do not collect $200. And check the first box, analysis pull-back. Click OK. Then, you should see it right here. Okay, now to run a regression, we just go to data analysis, there is a lot of cool stuff here which I don't have time to talk about anything but regression. Now, the y range is going to be, I hit the little button to shrink the dialogue box. If the sells per capita, x range is what you are using to make the prediction, so I click that little button again.

Okay, now, we have labels in the first row. You need labels or you don't know what your columns are, or refer to. And the new worksheet we'll say, I'm doing this in June, so I'll say June.

Regression1. Okay, now there's a bunch of stuff down here. The only thing I could care about is residuals. Residuals are the prediction from the Russian equation minus the actual. And, it's important to look at the outliers and see what, figure out what to do with data points that don't fit the pattern or prediction of most of the other data points. We'll see Finland's a huge outlier. So, we're going to put this in that new worksheet.

So, June1Regression. Let me check that we're right here. Okay, we're good. So, a couple of things that we want to talk about on the output. First of all, here's the equation. So, here's your predicted sales per capita. The best Excel could do, by multiplying each independent variable, comes from the coefficients here. So, it'd be minus, I'm just going to round up here quickly, because this won't be our best equation. Plus 0.002, times GNP, plus 4.22 times unemployment, plus 21.4, times education percentage. Okay. Now, first question you should ask, if you run a regression, you can throw stuff in that has no importance. If you want to predict how many games the Yankees win in a season, if you throw in,

how many toy stores there are in New York City, you're probably not going to get that to be an important variable. We look at at p-values for the independent variables. So, every independent variable has a p-value.

Between 0 and 1, and a low p value, let's say less than 0.1. Some mathematicians say 0.5, 0.05, means the independent variable helps us predict the dependent after adjusting for the others.

Okay, so we look here, we see unemployment rate. I mean, the 0.11 here is fairly close. I'm going to leave that in, because it's pretty close to 0.1. But, this means there's a 40% chance unemployment rate doesn't help us. Now, how good is the fit of our regression equation? There's two measures here, R Square and Standard Error. So what do those do? So, the R Square is 53%, our model explains 53% of variation in, let's just call it sales per capita for computers. 47% is not explained. How accurate is the model? This is the really important number. 58 is the Standard Error. And, so that means, and this comes from the normal random variable which will be discussed later in the class. For a normal random variable, and that governs things like height and weight, roughly 68% of your data is within one standard deviation of the mean. 95% within two standard deviations. So, I double that 58. Okay, I get 95% of the forecasts by plugging in this equation should be accurate within 170, 2 x 58.5. Okay. So, any prediction off by more than two standard errors, it's called an outlier.

Okay, so we're off by 117 or more, we have an outlier. So, come down to the residuals that I check. Residuals are the error, their actual minus predicted. And, how does Excel pick the coefficients in a regression?

You square the errors, actual minus predicted. So, positive negatives don't cancel out. It's like we did when we tried to find the best Pythagorean exponent. And, the sum of the errors will actually always be zero in regression, which is a beautiful property. because what that means is, a positive error residual means, we solve more than the regression equation predicted and negative means we solved less. The fact that they sum to zero, this is 0.12 zeros and 5, means that our equation splits the points in half. But let's look for an outlier, and we've got one right here. Finland, sixth country is Finland. Nothing wrong with Finland, I'm sure it's a great country. Matter of fact, the book written on The Smartest Kids in the World had several chapters on the Finland system of education. And, I think, the general consensus was from that book that Finland really has the best K through 12 educational system in the world. Mainly because they pay K through 12 teachers a lot of money to make it value of profession. We could learn from that. Okay. So, this 192, remember the standard error was 58, you double that, you get 116. So, the Finland prediction's off by more than three standard errors.

So, that's a huge outlier. Even two standard errors is an outlier. Now, you have to decide what to do with an outlier. I mean, can you explain it by adding other variables? I know nothing about Finland to be honest. This is such a different prediction. I feel like I need to throw it out of the equation and run it again. There might be something in the way that, maybe it's the Finland school system that we talked about or maybe the way the Finland government handles.

Maybe they subsidized the purchased computers, I just don't know anything about this. So, I'm going to throw Finland out, so, I've got a cal and unemployment rate had a high p value. I'll throw that out. So basically, well, we'll leave unemployment rate in for the moment because we're not sure about what would go on.

With unemployment rate, if we threw Finland out. So, you notice here, in the Finland out worksheet, there's no Finland. I deleted row six and I just ran the regression again. Again, I'm not going to spend the time on that but use this column as y, y values. And these columns will be your x values. If you run the regression again, and you get the result in this worksheet regression too. All right, notice there's 20 observations now. The R Square leaped to 74%, and the standard error is now 30, because that Finland observation had such an impact on our predictions. The standard error was about 59. Okay, now, let's look at the P-values. Okay, GNP has a P-value of 0.004. That means that after looking at unemployment rate and education spending, basically, there's only four chance within 1,000 GNP doesn't affect computer spending. There's an 84% chance unemployment rate doesn't affect computer spending, so we should throw that out. Now, what percentage is spent on education? Sorry, what's the p-value for that? 4%, so that's useful. And, let's see if we have any big outliers. If you double that standard error to 60, I think we've got one outlier there. Yes. Data point 18, which I think is Switzerland. Now, that's barely above two standards error, so I'm not going to throw that out. But we've done a lot in this video, but what we're going to do in the next video is toss out unemployment rate, you should throw out variables with independent variables or high P-values, because they're just not useful. And so, we're going to throw out unemployment rate, and then we'll see an example of throwing out an insignificant independent variable in actually our first video using regression in sports. So, we'll run the regression again, and we get all good p-values, all low p-values. Then, we'll use that equation to predict computer sales, and that would be a decent equation. So, we'll see you in the next video where we continue our discussion of regression.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.