0:13

We'll think of parameters of the distribution and the goodness of the.

Â For example the Chi-Square test.

Â Welcome to session three of the third week of Modeling Risk and Realities.

Â I'm Senthil Veeraraghavan again.

Â I'm a faculty at Operations, Information and

Â Decisions Department at the Wharton School.

Â So far we looked at data in visualization in session one, and

Â session two we looked at to the variety discrete and continuous distributions.

Â In this session we're going to focus on how well a distribution fits.

Â We will look at hypothesis testing and goodness of fit.

Â 2:51

In such cases, we recommend computer softwares to evaluate these tests,

Â however, we will run the Chi-Square test for two common distributions.

Â The normal distribution and the uniform distribution for our data sets.

Â What's a Chi-Square test?

Â Chi-Square tests the following null hypothesis

Â against an alternate hypothesis.

Â The null hypothesis could be the studied data comes from a random variable that

Â follows a specified distribution, such as a normal distribution or

Â a uniform distribution.

Â 3:44

In this test, you can disprove that the data came from a specific distribution.

Â But you cannot prove that it came from that distribution.

Â You can disprove that it came from a normal distribution, but it cannot

Â convincingly categorically prove that it did come form a normal distribution.

Â 5:05

Suppose you have ten buckets and you're trying to fit a normal distribution

Â that has two parameters, mean and standard deviation.

Â The degrees of freedom here is 10- 2- 1 which is 7.

Â For each Chi-Square test with some degree of freedom, we can predict the null

Â hypothesis with some confidence which could be set at 99% or 95% etc.

Â 5:41

We will explore Chi-Square tests on our two datasets.

Â Dataset1_histogram and Dataset2_histogram parts.

Â We used the Dataset1 in section one, and we generated the histogram.

Â The histogram gave us two curves, the pdf, which is the probability density function,

Â which we saw in week two, session two, and that's given in the blue bar chart.

Â And also, the cumulative distribution function,

Â which gives us accumulated values, and that's given in the red code.

Â Just visualizing the pdfs, such as a uniform distribution.

Â It looks that the pdf is pretty flat, and therefore, we run a Chi-Square test for

Â uniform distribution, and for the Chi-Square test, for

Â the minimum distribution, we're going to use min and max values for the data.

Â 6:42

So, what uniform distribution are we going to use?

Â We're going to use your uniform distribution with minimum value of 0.09

Â from the data set, and the maximum value 99.87 that we saw from the dataset.

Â So there are two parameters to the uniform distribution.

Â 7:10

We have 7 degrees of freedom because there are 10 bins and

Â 2 parameters and 10- 2- 1 = 7.

Â And in the Excel video, I show you how to generate the Chi-Square test.

Â We have the Dataset1 histogram file now, and

Â we are going to now look at how to test a Chi-Square test for our data.

Â Using a theoretical distribution,

Â a uniform distribution, with minimum of

Â 0.09 and maximum of 99.87.

Â For that, we need first to generate the theoretical cdf.

Â 8:00

If you recall the Formula and the discussion we had in session two.

Â We can do this by picking the value that we're interested in.

Â In this case it's 10 minus the minimum value.

Â And I want to fix it.

Â Divided by the maximum value

Â minus the minimum value.

Â And we need to fix everything there except for the first terms.

Â So we have this.

Â 8:36

And let me calculate the theoretical CDF all the way through,

Â except the maximum point is not 100.

Â We want to make sure that maximum point is not 100, but 99.87.

Â So that's our maximum value, so that gives us 1.

Â So we have accumulator distribution function.

Â Let's write that in percentages so that it's easy to view it.

Â From this,

Â we can also generate the theoretical probability of being in the bin.

Â 9:17

So the bin probability is, For

Â the first bin, it's between zero and ten so it says exactly that.

Â For the second bin alone, this is the accumulator for the first and

Â the second bin, so to calculate what's the probability, theoretically,

Â of falling within the second bin, you just take the second bin minus the first bin.

Â So that gets you 10%.

Â And we can do this for all the calculations here, all the way through,

Â and we get the bin theoretical probabilities.

Â So, give me a data set, and any data point in that random variable

Â distribution has a theoretical probability of falling in in the lowest bin at 9.98%.

Â And the probability of falling in the second bin is 10.02%, and so on.

Â So this distribution is almost uniform.

Â And the theoretical distribution is uniform but

Â the bin is cut off at .09 and 99.87.

Â So let's actually compare what's going to happen for

Â frequency for around 250 points.

Â 250 points, they're going to fall into this bin with these probabilities.

Â So 250 multiplied by the bin theoretical probability

Â gets you 24 points are likely to be in the bin.

Â And so on and so forth for all calculations.

Â So theoretically speaking, you should get about 24.8 points in the first bin.

Â The second bin you should have about 25 points, 25 points, and so on.

Â The first and the last bin are slightly smaller because they're getting cut off

Â not at 100, but 99.87, so you have a slightly smaller probability of bin here.

Â And so now we have the theoretical frequency

Â 11:21

and the actual frequency in the data set.

Â Now we can run the chi-square test.

Â So I am going to to write chi-square test here.

Â And we can write a number for the chi-square test, it's a formula.

Â It's chisq.test.

Â Then you get to choose the actual frequency range and

Â the theoretical frequency range.

Â 12:00

And you close it, you get about 0.0127,

Â just releasing it to three decimals, it's 0.013.

Â And that's the chi-square value you're going to use and

Â there are 70 degrees freedom on this.

Â So now we have done the chi-squared test trying to fit uniform

Â distribution on our data.

Â Let's go ahead and look at the table and

Â see whether we are able to reject the null hypothesis.

Â We generate the chi-square values, and the chi-square test gives us a value of 0.013.

Â 12:38

Now we can lookup that value for

Â degrees of freedom in the tables that I provided you.

Â For example, follow the web link and

Â you'll find the following, we fail to reject the null hypothesis.

Â That is, we fail to reject the hypothesis that the data comes from

Â a uniform distribution with a high degree of confidence.

Â 13:00

Remember, we cannot prove for sure that the data comes from

Â a uniform distribution but we have failed to reject the null

Â hypothesis that the data comes from a uniform distribution.

Â Remember, chi-square test is a one-sided test.

Â Now let's look at data set 2.

Â On data set 2, the figure gives us a histogram with pdf in the blue bars.

Â pdf is a probability density function.

Â And the CDF in the red curve.

Â The CDF is a cumulative distribution function.

Â And the visualization of the pdf tells us it looks like a normal distribution.

Â 13:47

Hence, let's fit a normal distribution on this data set.

Â We run a chi-squared test for normal distribution using the average and

Â the standard deviation from the data.

Â In data set 2, We will look at the goodness

Â of fit of a normal distribution for the sample average of 47.2, and

Â the standard deviation of 15.78, which we calculated from the data set.

Â 14:23

Again, the degrees of freedom is 7.

Â We run the chi-squared test as we see in the Excel video.

Â In the data set histogram, file we have the histogram

Â that we generated in the first session of this week.

Â We have a histogram that looks like a bell curve.

Â It suggests we should check our normal distribution.

Â So we're going to test for normal distribution in our data and

Â see whether the normal distribution has a good fit with our data set.

Â And chi-squared test is a goodness of fit test.

Â 15:03

To do that, the first step you're going to derive the CDF of the normal distribution,

Â theoretical, and we saw from formulas in session two where we derive this.

Â So we'll just use that normdist.

Â And we can pick a value x, and

Â we're going to pick the mean of the normal distribution.

Â We're going to pick as 47.20.

Â And the standard deviation is 15.78.

Â Let's fix those by choosing F4, and

Â then the last option is whether to go to cumulative or probability.

Â We want the cumulative, so press one, or write true, or choose cumulative.

Â 15:42

We get the value 0.001.

Â And we take it all the way to the last cell which

Â gets us to 0.9999589, which is pretty close to one.

Â But, it's not exactly one because the normal

Â distribution Has a tail going to infinity.

Â Once we have the CDF, we need to, this gives us the cumulative value or

Â the sum of all these bars up to that point.

Â To calculate, what is the bucket, what is the bar that fills into that bin,

Â we need to subtract two adjacent cumulative values.

Â So that's what we're going to do in the next column.

Â So we figure what is the probability of falling in each bin, theoretically.

Â 16:36

The first bin is just the value of falling, the first bin,

Â up to the first bin and so that's M4, 0.001.

Â The value of falling in, the probability of falling in

Â the second bin is the probability of being below the second bin but

Â above the first bin, so M5 minus M4.

Â And we do this for every value up to the last point and

Â we get the probability values.

Â To make better visual sense of this,

Â I'm going to convert this to percentages, and you can see that.

Â 17:18

Pick the data at random.

Â Where is it going to fall?

Â It has a 25% chance of falling in the middle or

Â 18% chance of falling in the mid ranges,

Â whereas 0.14% chance of falling in the lowest bin or 0.29%

Â chance of falling in the highest bin, so it gives the shape like a bell curve.

Â 17:47

That is supposed to be theoretically calculated.

Â And the bin frequency theoretically calculated is as follows.

Â We have 250 data points,

Â and each of the 250 data points has some probability of falling in this bin.

Â We just multiply that by the probability.

Â And we have 0.34 for the first bin, and we can take it all the way to the last bin.

Â So, we should expect about 61 values,

Â frequencies in the middle and very little towards the edges.

Â So, let's compare the actual frequency with the theoretical frequency.

Â The theoretical frequency is not in whole numbers.

Â 18:31

But this is our chi-square test.

Â Now we can run our chi-square test.

Â And the way to run it is,

Â chisq.test, chi-square test.

Â Pick the actual range of frequencies.

Â Pick the theoretical range of frequencies, and

Â you have the chi-square value, 0.8851.

Â Now we have seven degrees of freedom in our chi-square test.

Â We'll see that soon in the PowerPoint presentation.

Â We take this value of the chi-square test and

Â we're going to check whether normal distribution uses a good fit.

Â 19:30

Again, the precise value doesn't matter.

Â We look at the link for degrees of freedom, we find that,

Â we fail to reject the null hypothesis that the data came from a normal distribution.

Â 20:43

If this maximal difference value is low then the fit is very good.

Â Which means that we are comparing two columns in ascending order and

Â the gap between the two columns is never very high, then this is a good fit.

Â Typically a maximal value of 0.03 or

Â 0.04 or even lower is considered as a very good fit.

Â 21:09

Modeling Using Continuous Distributions.

Â As you can see, depending on the size and the nature of the data,

Â modeling reality using continuous distributions and

Â choosing the correct distribution that fits our data is a challenging task.

Â 21:35

Hence, in real life, often simulation is used, and

Â that will be our focus in week four.

Â Anyway, congrats on ending week three and best wishes for week four.

Â I'm Senthil Veeraraghavan, a faculty in Operations, Information and

Â Decisions Department.

Â And you can follow me @senthil_veer.

Â We've just completed week three of the course.

Â