Let's introduce some notation and

mathematical facts about the process of sampling and its error.

There's a profound theory behind this, but don't worry,

we won't go deep now,

you'll get just enough of statistics to use it than sampling big data.

Let's start with the simplest case.

Say you have a binary feature,

the one that could only take two values.

For example, in a database of criminal records,

it might be an indicator of whether their committed crime is violent or non-violent.

In a housing data set,

a binary feature might described if the apartment has a canal view,

or whether it has a roof terrace.

All the features could possibly take only two values.

Canal is either in front of the house or not.

For convenience, we'll encode those values with zeros and ones.

Now, then you have a feature like that,

you might be interested in the proportion of its values.

What is the percentage of the apartments with roof terraces?

What proportion of crimes are violent?

If you have the whole data set,

you could just calculate that proportion,

but could the same quantity be estimated from just a small piece of data?

Sure. Let's take N items from our data set.

Denote X with a superscript N,

a sample of size N. Elements of the sample denote as X with

subscripts from one to N. What is the proportion of ones in this sample?

It is actually the same as the average of all Xs.

It seems logical to use this average p-hat as a sample estimate of

the unknown proportion of ones in the whole data set P. Indeed,

if we had a whole data set we would have used the same average to calculate P itself.

It seems reasonable to apply the same operation to the sample to calculate the estimate.

This logic does not always work,

but in this case it does.

By the way, an estimate like that,

the one that estimates the value of the unknown parameter

with just one number is called a point estimate.

Could an estimate possibly be something other than one number?

You'll see very soon.

Let's get back to our sample of 100 taxi trips,

and the question of what percentage of passengers leaves tips.

There is a column called tip amount in the data.

Let's create a binary vector is tipped with

ones indicating whether these columns values are above zero.

The mean of this vector and our estimate

of the proportion of the tipping customers in the whole data set will be 0.66,

about 13 times more than I expected.

Now, is that estimate P Hat good?

Meaning is it close to the unknown value of P,

the proportion of ones over the whole data set?

Well, that actually depends on the sample itself.

First of all it must be random.

That's why we need to shuffle the rows.

Second, the bigger the sample of course the better the estimate,

the closer it is to the true value of the parameter

P. If you have just one object in the sample,

your estimate could only possibly be zero or one.

Given two objects you could also get 0.5 and so on.

The set of all possible values of P Hat grows with N. This is just common sense.

Bigger sample, better estimates.

But can we make this statement more precise?

Can we actually quantify the accuracy of the estimate?

There is a way to do that.

First of all, for the estimator you could

calculate the quantity called standard deviation.

It is a measure of the spread of the values your estimate could

take on all possible samples across its mean value.

For the proportion estimate P Hat.

The standard deviation is approximately equal to the square root of

the P Hat times one minus P Hat divided by N. In our taxi tipping sample,

the standard deviation is equal to 0.05 approximately.

But is it a lot,

or is it a little?

To eliminate that vagueness we need one more concept, confidence interval.

For the parameter P. It is a pair of functions of the sample CL and CU,

such that an interval from CL to CU covers

P with probability not less than one minus alpha.

Usually Alpha is set to 0.05,

and then the confidence interval is called 95% confidence interval.

It means that if we calculate such intervals on a hundred different random samples,

about 95% of them will cover the true unknown value of the parameter.

Confidence interval estimates the value of the parameter not with just one number,

but with a whole interval.

So it's not a point estimate,

it's an interval estimate.

For the proportion, confidence interval is given by the formula P Hat plus minus Z,

times the standard deviation of the estimator.

Z actually depends on the confidence level Alpha,

and for standard Alpha equal to 0.05.

It is equal to approximately 1.96 or even more approximately two.

This formula is a more precise version of the two sigma rule.

You might have heard of it. For our sample of 100 taxi trips,

that 95% confidence interval for the proportion of tipper's could be

calculated with the function proportional confident from the module stats models.

It gives us the interval from 0.567 to 0.753,

and it's pretty wide.

It turns out that we are not so sure in our estimate of 66% of tipper's,

with 95% confidence, the percentage of tipper's might be as low as 57% or as high as 75%.

Can we get a more precise estimate?

Sure we can. We just need a bigger sample.

Just like point estimates get more precise with growing sample sizes,

confidence intervals get narrower,

but how big do we need N to be exactly?

There is a special function, sample size underscore,

confidence underscore proportion that's giving your guess of

the true proportion and the desired precision

which is a half width of the confidence interval,

returns to the required sample size.

If we want our confidence interval to be 2% wide, as you can see,

we might need a sample of at least 9,108 taxi trips.

Let's take a sample of 10,000.

In that sample we have 61% of tipper's with 95% confidence from 60.3 to 62.2.

Indeed, the width of this interval is about 2% just like we wanted.

From this video, you learned how to estimate a proportion of ones in the binary feature.

A point estimate is just a proportion of ones in a sample.

We have also talked about the concept of confidence intervals which

allow us to explicitly quantify the degree of your uncertainty in the estimate.

And learn to apply functions proportional confluent

which calculate the interval for the proportion and

sample size confluent proportion which helps you to choose

a sample size big enough to obtain a confidence interval as narrow as you like.