0:00

Hi. In this lecture, we're gonna talk about a very simple class of models that

Â helps us make sense of data and these are known as categorical models. In a

Â categorical model, what you do is you basically bin reality into different

Â categories, and you hope these categories help you make better sense of the data.

Â That they explain some of the variation in the data. I wanna start out by just

Â describing what a categorical model's like and we'll talk about how they can help us

Â make sense of data. So let me give an example. A long time ago, over a decade

Â ago I was at a conference in Amazon, which is a company that, you know, sells all

Â sorts of stuff over the web, [laugh] right? Had, was just going public. And

Â there was a discussion whether Amazon was a good investment or not. So one person

Â who was a Wall Street investor said, you know, I think it's a horrible investment.

Â If you think of Amazon all it is, is it's just a delivery company. Like they just

Â got a big warehouse, you order stuff, they deliver it. The margins in that industry

Â are really small. Like there's already UPS and EPX, [laugh] UPS and FedEx and DHL and

Â all those sorta places. I just don't think there's any money in it. Now, another

Â person said, you know what, I'm gonna put Amazon in a very different box. I'm gonna

Â put this box over here in a box that says information, because I think its part of

Â the new information economy. They're gonna gather all this information about what

Â consumers want. It's all going to be centrally held. They're gonna be worth a

Â ton of money. Now it turns out if you put Amazon in this information box, you

Â probably would have invested in it, and you'd have made a lot money. If you put

Â Amazon in this delivery box, you wouldn't have invested in it, and you wouldn't have

Â made a lot of money. So which box you use [inaudible] how you categorize things

Â affects how you think about things. And, again, what sort of decisions you make. So

Â this leads to a phrase that one of my friends who's a physcholy, a psychologist

Â [laugh] once said, lump to live. What my friend meant is this, is that, we create

Â these lumps, these boxes, these categories in order to make sense of the world. So I

Â look out there in the street and I see a vehicle. I don't say oh it looks like a

Â 1997 Ford F150 pickup truck, right? Instead I just say truck or I just say

Â car. Or if I look at a piece of furniture I just say it's a dresser. I don't say

Â it's an 1874 Chip N Dale dresser. I don't completely break it down. I just put

Â things in categories. And these [inaudible]. They're short cuts, right?

Â They help us make sense of the world. And let's think about why we model again,

Â right? One of the reasons we model is to help us decide, strategize and design.

Â Right. So one of the reasons we lump is it helps us make quicker faster decisions

Â where we just put things in categories and say this is something I like, this is

Â something I don't like, this is something that's risky, this is something that's not

Â risky. Let me give some examples. Give some fun examples. So the first one is

Â let's suppose you?re a kid and you gotta decide what am I gonna eat and what am I

Â not gonna eat. Well one sorta categorization you might use is the green

Â categorization. So you might say anything that's green. Proxy screen. A

Â grasshopper's green, asparagus is green. All these things are green, right?

Â Everything else, bananas, those are yellow. Candy bars, brown, orange, they're

Â orange. Pears, pears can be green, but we'll assume they're yellow. And

Â strawberries, they're red. These other things aren't green. And so your rule

Â could be, I'm gonna eat anything that's not green and I won't eat anything that's

Â green. And so that rule will keep you safe from things like grasshoppers and

Â asparagus. Right, so that's a rule you might follow. Now it's not an optimal rule

Â because you might run into a green pear and that green pear might be something

Â you'd really like but if you've been avoiding green things, you may decide,

Â well not gonna risk it. So that the same, an example of a simple rule. Let's now

Â show how you can use a rule like that to make sense of data. So now let's suppose

Â I've got a bunch of data here, and these are different food items, and what you've

Â got in this column right here are calories. So this is how many calories

Â there are in each of these, these food items. So what I want to do is I'm trying,

Â I wanna make sense of why do some things have a lot of calories and some things not

Â have a lot of calories. And so I've got this list of items. Well the first thing

Â that I need to try to make sense of is how much variation is there in this data. Well

Â to understand how much variation there is, first I've got to find out, sort of what's

Â the average value? And then variation tells me, how far are things, on average,

Â from that value? So if I add all this up, I've got a 100+250 that's 350, 440, 550,

Â 900 right? So we've got 900 divided by five, so that means the mean here Is 180,

Â so on average, everything in this group has about 180 calories. And I want to ask

Â some things are higher. Right? This is 350 and some things are lower, this is 90. I

Â want some understanding of how much variation is in that data. So one way to

Â do that we just subtract the mean from everything. So if we take 100 minus 180

Â that's gonna be minus 80. 250 minus 80 that's gonna be 70. 90 minus 180 that's

Â minus 90, right? 110 minus 180 is minus 70 and 350 minus 80 is 170. Well if I add all

Â these things up I'm gonna get minus 80, plus 70, minus 90, minus 170, plus 70,

Â it's be zero because it's gonna be the same as a mean. So what I need is I need

Â all these differences to be positive. So one thing I can do is I can just take the

Â absolute value. Of all these things. Right? And then I could add up the

Â absolute value. And I could get 80 plus 70 is 150. Plus 90 is 240. Plus 70 is 310.

Â Plus one 70 is 480. So we could say, the total difference from the mean is 480. But

Â what we do in statistics, is we tend to do something different. We actually tend to

Â take the difference and square it. And the reason we square it is really twofold. One

Â is that again it makes everything positive, which is what it did before. And

Â the other thing is that it amplifies larger deviations. Because what we'd

Â really like to do is prevent those huge deviations. This is gonna amplify large

Â deviations. So if I look at the pair, I would have 100 minus 180, which is 80

Â squared which is 6,400. So that's how much variation there would be. So that's the,

Â how much the. Difference from the pair, pair to the mean squared. And if

Â [inaudible] for the cake, I'm gonna get 250-180 which is 70. And if I square that,

Â right, I'm gonna get 4900. Now I could do this for everything. All of them right? So

Â for the pear, I get sixty four hundred, for the cake I get forty nine hundred, for

Â the apple eighty one, for the banana forty nine, for the pie, twenty eight [laugh],

Â thousand, nine hundred. So this is again, a long way from the mean, and you square

Â it at a huge effect, so square amplifies larger mistakes. Now if I add up all these

Â numbers, I'm gonna get fifty three twenty. That's what we call a total variation. So

Â I plotted that data, this tells me sort of, how much variation is in that data,

Â what I'd like to do is keep. [inaudible] categories [inaudible] that reduces that

Â variation that somehow explains why something are high and some things are

Â low. So what's the obvious. Categorization. The obvious categorization

Â here is that pears and apples and bananas are fruit and cakes and pies, right, are

Â desserts. So let's create a fruit category and a dessert category. So in the fruit,

Â I've got one thing that's 90, one thing that's 100 and one thing that's 110. And

Â in the dessert category, I've got one thing that's 250 and one thing that's 350.

Â So let's look at them in more detail. I've got 90,100, 110, the mean Is gonna be 100

Â here, right? The average is also 100. What's the total variation? Well, 90 minus

Â 100. Is just ten, so if I square that, I get 100. 100 minus 100. Right. Is zero, so

Â if I square that, I get zero. And one ten minus 100. Is also ten, so if I square

Â that, I get 100. So the total variation here is just gonna be 100+100 or 200. So

Â now what I've done is I've got a mean of 100, and a total variation of 200. And

Â now, if I go to this case, the mean is gonna be 300, right, for the desserts. And

Â what's the total variation? Well, for the cake, it's 250-300, which is 50 squared,

Â which is 2500. And for the pie, it's 350-300, which is also 50 squared. Which

Â is 2500 [inaudible]. Add those up, I get 5000. Alright, so let's clean this up a

Â little bit. So what I did is by creating two categories, a fruit category and a

Â dessert category, I now have a mean in the fruit category of 100 and a variation of

Â 200, and a mean in the dessert category of 300 and a variation of 5000. Now one

Â [inaudible] I started out with, right? When I had all the stuff together. I had a

Â mean of 180 and I had a variation. Of 53,200. Now look at how much my variation

Â has gone down. It went from 53,000 to 5,000. So here's to the idea. These

Â categories substantially reduce the amount of variation I have left over. So think of

Â the variation as what's unexplained. So initially I say look I can just say things

Â on average of 180 calories and we've got 53,000 units of variation that's [laugh]

Â unexplained. Now I say wait a minute, I'm gonna create a categorical model that says

Â there's fruit in desserts and fruits have few calories than desserts. And you can

Â say we?ll look it appears to be the case. Fruits have a mean of 100. Desserts have a

Â mean of 300 and the variation in the fruits is only 200 and the variation in

Â the desserts is 5,000. So I've reduced variation a ton. What we want is we want a

Â formal measure of how much we've reduced variation. That's actually fairly simple,

Â right? So that a total variation of 5300, fruit variation is 200, dessert variation

Â is 5000. So that gives me 5200. So 53,000 start and I get 5200 left. So what we

Â wanna ask is how much did I explain. That's the question. How much of that

Â radiation did I explain? Well, then I started out with 53,000, right 200. And

Â now I only have 5,200 left over and so the amount I explained is just 53,000 minus

Â 5,000 right which is 48,000 right. >> Over 53, two. So the percentage of [inaudible]

Â I explained was 48,000 divided by 53,000 which is a huge amount. And I can write

Â this more simply as just one minus the amount I didn't, that's left over. One

Â minus 5,200 over 53,000. So, right? Because that's just a simpler way to do

Â it. And so when I get that the amount of [inaudible] I explained was 90 thou- 90

Â percent. So 90.2%. So that's how much of that variation I explained. This is equal

Â to, again, that 48,000 right, divided by 53,200, the amount of variation that I

Â explained. Now, formally, this is called the R squared. So this is the [inaudible],

Â the percentage of variation that I explained just by that simple

Â categorization. So, if the R squared is near one. That means I explained almost

Â all the variation, so the model explains a lot. Right? If the R squared is near zero,

Â that means I didn't explain any of the variation really, and the model doesn't

Â explain very much at all. Now the better the model, the more R large R squared

Â it'll have. But depending on, there could be so much variation in the data that even

Â a great model. Only has an r squared of five or ten%. There also could be

Â situations where the thing you're trying to explain in pretty understandable and a

Â good model has to have an r squared of 90%. So there's no fixed rules whether,

Â you know, what a good r squared is. It depends what the data looks like. But with

Â a class, you know sort of class of models, or you know that's [inaudible] data class,

Â you can sorta figure out this is a good model, this is a bad model. Based on

Â experience. Let's push this a little bit further. We had, you know, fruit and

Â desserts, right? Those are our two categories. But if I had, you know, a

Â whole kitchen worth of food, it may be the case, that, like, I'd wanna have more

Â categories. So I might create a vegetable category and a grains category. And then I

Â could put everything in one of these four boxes. So one of the differences between

Â sort of experts and nonexperts is experts tend to have more boxes. They also tend to

Â put the right things in the right boxes, so they tend to have useful boxes. So if

Â you want to be good at sort of predicting things or understanding how the world

Â works, what you have to have is a lot of categories and you have to have those

Â categories be the right categories. They've got to explain a lot of the

Â variation. And we can measure how much of the variation it explains, your model

Â explains, by using that R squared. One last point. Even if you explain a lot of

Â variation, it doesn't mean you've got a good model. Let's go back to the schools

Â case. So suppose I'm trying to figure out what makes a good school, what really

Â leads to a good school performance. So I try all sorts of different boxes. I look

Â at schools that spend a lot of money versus schools that don't spend a lot of

Â money. Schools that have small class sizes and big class sizes. Schools that are big

Â and schools that are small, right? And nothing really seems to explain too much

Â of the variation. And then I create a box that I call the equestrian box. And I put

Â all the schools in here that have equestrian teams. And I find, oh my

Â goodness, every school with an equestrian team is great. Well, the thing is, that

Â doesn't mean that the equestrian team made the school good, right? So statisticians

Â make a distinction between correlation. Which is, is there a statistical

Â relationship between having an equestrian team and being a good school? And

Â causation, did the equestrian team cause the school to be good? So remember when we

Â draw, when you think about putting it in this box. Like, this box right here is a

Â bunch of good outcomes. And this box here has mostly bad outcomes. That doesn't

Â necessarily mean that the thing that created this box, if it's the equestrian

Â box. Is the reason that the schools were good. It could be that there's some other

Â reason. So why would you have an equestrian team? Well you only have an

Â equestrian team if you had a lot of money. And you probably also only have an

Â equestrian team if you have a lot of parental involvement, things like that.

Â Like a lot of support from the community. And you only have an equestrian team if

Â you have a lot of open space. So it could be that having an equestrian team is a

Â proxy for things like money, parental involvement, open space, right, those

Â sorts of things that actually do make a school good. So even if your boxes work,

Â that's no guarantee that they're actually the cause of why it works. Okay, so what

Â do we have? We have one way, [inaudible] the simplest model you can possibly have

Â is a character model. You can say, Amazon is a delivery company. Amazon's an

Â information company. You can say things are either fruits or desserts. Right? And

Â by creating these boxes it can help you sort of explain the variation in data.

Â What we saw then is a simple way to measure how good a categorization is, is

Â how much of that you explained the percentage of the variation you explained

Â and that's what we called R squared, right? R squared was just take all the

Â variation that was there and then ask how much was left how much was left over. That

Â means and then we subtract those two that tells you how much you explained. So you

Â ask what percentage of all that variation do you. Let me start over. Okay, what have

Â we learned. What we've learned is this. We've learned that the simplest kind of

Â model you can have is just a category based model, right where you just sorta

Â lump the world in different categories, and you place your data in different boxes

Â depending on what different data it is. So that could be information companies versus

Â delivery companies. That could be fruits and desserts, right? And in doing that,

Â what you could do is you can reduce the amount of variation you see in the data.

Â So there's a total variation which is sorta of just like how much unexplained

Â variation there was out there in the world, by putting it in boxes you organize

Â it in such a way that you reduce the variation. The amount at which you reduce

Â the variation is what we call the r squared. That's the percent of variation

Â explained. And the more variation explained the better your categorization

Â is. Of course if you create more boxes you can explain more of the variation. Where

Â we're going next, if you've got linear models, which in effect can create a

Â different box for each value of x, our dependent variable. Okay, thanks.

Â