0:50

we can try to predict how popular different movies are going to be.

Â So, this is just a summary of the data that's up on the course website.

Â We've got a sample of movies that were produced during the 2001 through 2005

Â years, with a lot of information available about those movies.

Â Now, if we look at the information that's available, features such as

Â what genre is it, which studio produced it, what's the movie rating?

Â Is it based on an adaptation of a graphic novel or a novel?

Â Is it based in some other media?

Â Some of these are yes/no answers.

Â Other variables might have multiple options available,

Â not just the two options.

Â These are all categorical outcomes.

Â That's the common thread here.

Â 1:37

We've also got a block of financial measures.

Â So, how much revenue was brought in?

Â What was the production budget?

Â What was the marketing budget?

Â These are all quantitative variables.

Â So, the nature of the variable, that's going to

Â inform the type of analysis that we can apply to it.

Â All right.

Â So some of the ways that we might start looking at the categorical data, and

Â we'll go through some of these.

Â I'll demonstrate them, I would encourage you to spend some time working with

Â the Excel file to make sure that you're comfortable, not only generating these

Â different reports, but also understanding the trade-offs associated with them.

Â Frequency tables are going to report numerical values to us as our

Â contingency tables, or cross-tabs.

Â We might look at pie charts, bar charts,

Â column charts as ways of visualizing some of this output.

Â All right, so if we wanted to put together a frequency table.

Â And I'll jump over to the Excel file, so

Â that we can see what we're working with in a second.

Â This just gives you a snapshot of the number of movies being produced by each

Â studio in a particular year, that's in the middle column.

Â So, if we look at, of these 199 movies,

Â we can see how they're distributed across the different studios.

Â Notice that they're ranked in descending order.

Â So, Universal had the most movies out in this year.

Â Followed by 20th Century Fox, Warner Brothers, and so forth.

Â As we go further down this list, the other category,

Â lumping all of those smaller studios that had less than five films

Â in that year produced, makes up a total of 44 of the 199 films.

Â And then the column on the side is putting that as a percentage basis.

Â So they're saying Universal produced 9.05% of the films,

Â 20th Century Fox produced 8.54%.

Â Descending percentage until we get to that final row where we've lumped the others

Â together.

Â That percentage adds up to 100%.

Â Now the way that we've reported it here,

Â we're reporting for each studio what percentage of films were produced.

Â You might also want to produce a cumulative column.

Â So Universal produced 9% of the films, 20th Century Fox produced 8.5%.

Â So we might want to say, all right, well, the two studios that produced the largest

Â films or the most films combined how much did they produce.

Â It would be running sums, so lets start off with on the 9.05% for

Â 20th Century Fox and larger studios, we add in the 8.54% for

Â Warner Brothers, add in another 8.54% as we go further down that column,

Â adding in the films produced by the smaller studios.

Â We get closer and closer to 100%, and

Â that's what this cumulative distribution would show us.

Â So, on the X-axis or the horizontal axis,

Â one corresponds to the studio that produce the largest films,

Â 37 corresponds to the studio producing the fewest number of films.

Â And as we include more and more studios as I move from left to right on this graph,

Â it accounts for a larger share of the films that have been produced.

Â 4:56

Another way of looking at this data, numbers tend to be a bit sterile.

Â We might want to put that into the bar chart.

Â And so we can see how many of these, how many films were produced by each studio.

Â So that frequency table, that can be reformatted and put into the bar chart.

Â And this is just focusing on a subset of those studios,

Â just those that had at least five films.

Â Maybe easier in terms of delivering reports rather

Â than including a massive table to have charts similar to this one.

Â Another way that we might represent the distribution

Â of films across the studios would be with a pie chart.

Â And I have a little bit of a love/hate relationship with pie charts and

Â you start to see why in this case.

Â We've got a lot of studios that make up a very small slice.

Â Well, think of making a more and more narrow slice of the pie.

Â Try splitting that other category into the individual studios.

Â Fitting these data labels onto this chart is going to become very difficult.

Â So in this case, we've included the name of the studio on the pie chart.

Â We've included the percentage of the films that those

Â studios are producing and we can still see it on this chart.

Â As you add more and more studios,

Â as you have a categorical variable with more and more values.

Â These pie charts may become less useful because you can't visualize all of

Â the possible options.

Â Ways around that might be to lump sum the options together.

Â So that's what we've done in this case with the 22% falling into

Â that other category.

Â 6:39

All right.

Â So just a couple of words of caution with these charts, and we've talked about

Â categorical variables in the sense that a movie falls under a studio.

Â A movie falls only under one of the studios in our data set.

Â Well, when you're making bar charts, when you're making pie charts,

Â that's a requirement in the data that each observation can only fall into one of

Â those categories and all of your options are going to have to add up to 100%.

Â One of the other things to be careful of is, we focused just on studio.

Â What if I wanted to look at studio by rating?

Â So let's look at the movies that are PG-rated movies

Â from the different studios.

Â And I want to draw some comparisons between the PG and the PG-13 movies.

Â Well, a single pie chart is not necessarily the best way to go about

Â doing that.

Â I might have to do side-by-side pie charts or

Â side-by-side bar chart to make those comparisons.

Â All right, but these are just ways of summarizing the categorical data that's

Â available to us and visualizing that, and very helpful from a recording perspective.

Â Bar charts, pie charts, the frequency tables,

Â it's ways of summarizing a single categorical variable.

Â But what about when we want to see the relationships that exist among

Â two categorical variables or even more general than that?

Â What if they're three or more categorical variables?

Â Well, one of the popular tools that we can use to do that is a contingency table or

Â a cross-tab table.

Â So using the data that we have available to us, we can put together these tables.

Â Say, we wanted to look at the studio and the genre of movies that were in there.

Â Perhaps we want to see how many movies of a particular genre are made by different

Â studios.

Â Maybe we want to look at the relationship between studio and ratings.

Â 8:41

So this is one way of looking at this data.

Â This is the raw count of the data.

Â You'll notice, going across the rows, we have these studios.

Â If we look down the columns, this is looking at the movie rating.

Â So we have G, PG, PG-13 and R, and

Â these are the counts of how many movies of each rating were made by these studios.

Â This is the cross-tabs, so we're trying to look at the relationship that exists

Â in terms of studio and ratings.

Â 10:07

Think of this if we're looking at that last row, it's the margin of the table.

Â It's the marginal distribution for us.

Â And so, what fraction of movies were rated G?

Â 23 out of 351.

Â What fraction of movies were rated R?

Â 89 out of the 351.

Â If we wanted to look at how many of these movies came from a particular

Â studio with a particular rating?

Â That's when we're going to jump into the individual cells.

Â So Buena Vista movies rated G were 20 out of the 351.

Â If we look at Buena Vista overall, I could add up this entire row.

Â And that's going to tell me that they produced,

Â that studio produced 87 movies out of the 351.

Â So this tool, good for looking at two different variables,

Â in this case we're looking at just those counts.

Â This is produced entirely in Excel, it's using the pivot table feature,

Â very convenient as far as organizing data and providing quick summaries.

Â 11:16

Same data that we were looking at previously, but in this case,

Â I've just reformatted that data.

Â So instead of saying, let's count up the number of movies, in this case,

Â focuses on the fraction of the total.

Â So you'll recall from the previous slide, there were 351 movies.

Â Well, now we're looking at 100% of those movies.

Â So divide each entry by 351 and we can see what the percentages are.

Â So, movies rated G made up just about 6.55% of movies

Â released by these four studios.

Â Whereas movies released by Universal made up just shy of

Â 22% of movies released by these four studios.

Â