So it is
this that we're basically queried against.
What have you done here? We said go ahead and select the departure_dalay,
and count the number of flights.
So this is the number of flights out of
a specific departure_delay because you are grouping by departure_delay.
So for example, if the departure_delay's negative 37.
In other words, that the flight departed 37 minutes early,
how many flights were there?
There are 107 such flights in the dataset,
and these are the quantiles.
So, this is each 28 percentile, right?
Because it's divide by five.
Like 80 percent of those flights,
arrive 66 minutes or more early,
and 60 to 80 percent of flights arrived between 41 minutes and 66 minutes, and so on.
So we had a question that I asked you,
if the departure_delay's 35 minutes early,
what is a median value?
And the median value,
would be the value in the middle,
right? So, 28 minutes.
So, if you go back to our console,
we now see that Datalab asks us whether we want to continue, and say "Yes."
Go ahead, then accept all of the things.
So let's go ahead, and run these other query.
To go ahead and find the airport-pair.
Airport-pair meaning a specific departure airport and
a specific arrival airport that has a maximum number of flights between them.
So this is again from the same table,
but now, I'm selecting the departure_airport,
the arrival_airport, and counting the number of flights
but grouping by both the arrival_airport, and departure_airport.
And ordering by number of flights descending which means,
that the airport-pair with the maximum number of flights will be the first,
and I'm limiting 10ths.
I'm going to get there first 10.
The 10 most common of those.
So notice that this is something we've processed 17 million records.
And when I did it,
it took me 2.3 seconds.
How is that possible? Well, it's because
the 70 million records weren't done on this one machine that I'm running on, right?
Where I'm running it, it's run on thousands of machines.
It's run at scale.
And this is what it mean when we say we launch services on the Cloud,
we do these things in a serverless way.
But anyway, going back here,
it turns out that if the departure_airport is LAX,
and the arrival_airport as SAN,
that is a 133,000 flights.
So that's the airport-pair with a maximum number of flights between them.
So at this point, now when we go back to Cloud shell.
We see that we might click on the Web preview,
and change port to 8081 to start using Datalab,
that is this item here, Web preview,
so select that, change the port to 8081.
And at this point,
we are now inside Datalab.
Everything that you've done in BigQuery so far has been great.
We have been able to go ahead and run SQL queries on millions of rows of data,
get our answers back in seconds.
That's great, but what we really want,
in addition to getting those answers is to do things like drawing graphs, et cetera.
We want to be able to visualize the data.
And visualization is one of those things that you can't do in the BigQuery Console.
We want to use a custom visualization tool.
In this case, we're going to use Datalab,
which has full access to all of
the Python goodness to go ahead and do all of our graphic.
So what we're going to do here is that we're going to run one of our queries,
but we're going to do this not from the BigQuery Console.
But from within Datalab.
So here we are in Datalab,
I'll go ahead and start in your notebook.
And in this notebook,
what we have here is a code cell,
so I can go and paste the code in that cell,
and hit "Run" to run the code.
So, all of this is being executed by BigQuery.
So in the same order of seconds,
we're going to be analyzing this millions of flights,
and what we're now doing is I'm getting it back as a pandas dataframes.
So.two_dataframe here is a pandas dataframe.
So, it basically shows you the first few rows of that dataframe, and as before,
we have a departure_delay,
we have the number of flights,
and we have the deciles because in this case,
I'm doing the quantiles as 10.
So there are 10 of them,
and they get them back as a Python list.
If you now go ahead and take the same dataframe,
and we will basically go ahead and do a quick rename,
what we now have is we've taken this deciles data,
and we've broken it up,
and gotten 0 percent,
10 percent, 20 percent, 30 percent,
et cetera, as separate columns.
Why am I doing that? By doing separate columns,
it allows me to do the next thing that I want to do.
So, let's go ahead, and