0:04

In this video, I'm going to show two things.

Â The first, data manipulation and the second,

Â I'm going to introduce you to the airlines data

Â set that we're going to be using for the next few videos.

Â This is the help page for the python data manipulation functions,

Â and I find it by googling for H2O python docs.

Â I think it was the third or fourth link there.

Â So it's the H2O-py docs, and you want to be in a frame.

Â I mention that because I'm not going to go exhaustively through all the functions.

Â I'm just kind to pick out a few.

Â So let's start H2O,

Â just as we normally do.

Â 0:58

And this is the data set.

Â Airlines, allyears2k headers, a zip file, H2O.

Â This is why we love you H2O.

Â It will take care of that and find the CSB file inside.

Â It's described as small data but I think it's got,

Â oh well, let's find out how it is.

Â Give it a moment. There we go.

Â What have we got? Lots of

Â columns and it doesn't tell me how many rows.

Â Go ask Data.nrowd.

Â 1:47

Okay, 44,000 rows.

Â So a couple of order and magnitudes more than our lists.

Â Dates, departure time, arrival time,

Â these are obsolete numbers, not so useful.

Â What we're interested in, unique carrier.

Â I will just mention because we're going to come back to that in a moment.

Â Flight numbers. Elapsed time is interesting.

Â This is what it should have been and this is how long that flight actually took.

Â How long it was in the air?

Â How late it was arriving?

Â I believe this is in minutes.

Â So, a negative number would mean it arrived early and late was departing.

Â Some airport names. Where it came from?

Â Where it went to? How far it was flying kinds of stuff.

Â And these are if we do a binomial classification,

Â this is most likely what we will be trying to learn.

Â Was it late? Is the arrival delayed? Yes or no.

Â If we do a regression,

Â chances are we will be trying to predict either the arrival delay or departure delay.

Â Come down a bit more.

Â First useful function.

Â H2O does generally a good job at detecting your data column.

Â For instance, this one, full of yes and nos,

Â it detected it was an enum,

Â also called a factor, also called a categorical.

Â A lot of the other columns,

Â its detector is integer.

Â 3:50

There are no floating point numbers in this particular database, they're all integers.

Â Unique carrier is detected as enum.

Â So I am coming on these out because don't actually have to do anything,

Â but if it gets it wrong and you want to convert

Â a numeric column to a factor, you use this command.

Â And if it's the other way, it turns less likely,

Â but if it's made a column a factor,

Â when it should have been numeric, you use this function.

Â Let's just hop over to R and take a look,

Â get the summary, a different layer to the way Python shows it.

Â Sometimes I find the Python way easier to understand, sometimes the other way.

Â 4:51

When you're converting data with the R API,

Â it's almost the same.

Â You're going to need this comma there and the comma there.

Â And the as.factor is a function.

Â It's a global function, I should say.

Â And the data column you want to convert is the argument to it.

Â Stick with R, we'll just run this line,

Â trying to get the mean of the airtime column.

Â I already know it from the summary, of course.

Â It is, rumble please, airtime.

Â There we go, 114.3.

Â But there are 16,649 NA's, missing data.

Â That's why we got not a number.

Â To get around that, we say we want to ignore the NA's.

Â This is the mean of the remaining,

Â whatever it is, 35,000 columns.

Â We can use the function mean,

Â it's a synonym for H20 mean.

Â So, this function does exactly the same.

Â We've got the range function.

Â I should say these calculations are happening on your H2O server in the cluster.

Â If your data is really big,

Â it's not being downloaded as the R client and the calculations done there.

Â They're all happening remotely.

Â Let's just jump back to Python and see those commands.

Â More object oriented, so we select our column and run the mean function on it.

Â And I believe that's identical.

Â I couldn't find the range function,

Â but you can use summary and get the min and the max.

Â 7:12

We can see most flights are a little bit late,

Â with a very long tail.

Â Wouldn't they have drawn the histogram over here if there wasn't something here.

Â We'll jump back and see that in R. There we go.

Â Oh, this is a different field airtime.

Â Here we've got two bumps telling us most flights are short, some are long flights.

Â 7:55

We can do more than one column at a time.

Â So, arrival delay and departure delay.

Â Ignore this error message, that's just to ask you they are complaining about the plot.

Â So the arrival delay was 9.3,

Â the mean of it and the average departure was 10 minutes late,

Â 8:30

We can also do some logical questions.

Â This creates a logical vector,

Â the same length, the same number of rows of data.

Â It will be a one,

Â if the flight was delayed more than six hours, 360 minutes.

Â It will be a zero if less.

Â And then we ask any,

Â are any of the flights,

Â were any of the flights delayed more than six hours? A one means yes.

Â Is it rephrasing? Where all the flights delayed no more than eight hours?

Â And we get false. Read the comment.

Â The problem is we NA's in this.

Â If we get rid of the NA's, we get true.

Â None of the flights were delayed more than eight hours.

Â 9:37

Always bear in mind your NA's because this can give you the wrong result.

Â I was nearly tricked, but I remember seeing 475 arrival delay,

Â the maximum was 475.

Â So I knew I was expecting a true.

Â Well, let's just say what this does.

Â Cumulative sum, it's adding the numbers up.

Â The first row was 23 minutes late,

Â the next one must have been 14,

Â 23 plus 14 , 37,

Â and so on. Let's keep this moving.

Â You'll find this file somewhere.

Â Come and play with it yourself afterwards.

Â This is how to do a correlation between two columns.

Â 10:37

And again, we need to get rid of the arrival delay,

Â departure delay, highly correlated.

Â Come back to Python. That's how you do the same example there.

Â And this time, I'm doing three columns,

Â rather than specifying two arguments,

Â the correlation is specifying three columns of the same data frame.

Â It gives me this nice table.

Â 11:11

We can see arrival delay, departure delay,

Â correlated but there's a very low positive correlation with the length of a flight.

Â Okay. I think that's enough. Study the manuals,

Â get familiar with the functions

Â or just go hunting for a function when you find you need something.

Â