0:00
Cleaning a dataset is the first step to truly understanding all the possibilities it has.
You can't hope to get advanced insights like building machine learning models,
without first ensuring that you're getting an accurate and complete picture of your data.
And as we discussed, exploring your dataset can also uncover a data skew,
which when discovered early,
can help prevent performance problems you could face in the future.
Lastly, we compared two ways of transforming our dataset.
The first, you can build normal SQL statements as part of your extract
transform and load process and or you can use Cloud Dataprep flows,
wranglers, and recipes to automatically create those pipelines for you through a UI.
For our next lab, let's add some transformations to that IRS data we started earlier.
Okay. Here we are with the first of the two Cloud Dataprep labs.
So, again pause this video,
if you haven't attempted this lab yourself yet and come back for this walk through.
So again, to reviews Cloud Dataprep tool itself is
a great pre-processing web UI tool that allow you to create
those pipelines that actually run on Cloud Dataflow behind the scenes.
So let's go ahead and follow along with this lab,
whose main objective is to get data from
Google Cloud Storage ingested into Cloud Dataprep.
In preparation for the next lab,
we're going to be doing some cool transformations on that data.
And again all this data is on our IRS public dataset.
So let's go and open up the Cloud Console if we have
open here and we're going to be creating a buckets.
And I am gonna go ahead and copy this name that we have here.
And to do that, what we want you to do is open the Products & Services menu
here on the left and we need to find storage,
storage right here is Cloud Storage.
Let's go and see what buckets we have.
So we have the default bucket I'm going to find that hard to remember that names,
so I'm going to go ahead and create
a brand new bucket and paste in the name that I have here.
Buckets have to be globally unique
so just in case there are other Evans in there that want to take that bucket feel free,
do first name last name and keep the rest of
the defaults that show up there, as they are.
Create that bucket.
Boom. That bucket is created. That's excellent!
And again to review, a bucket is a amazing staging area on Google Cloud Storage,
where you can just throw as much of your raw data as you
want and then you can ingested it into a BigQuery or a Cloud Dataprep.
So in this particular case,
we're going to be using this bucket to actually save
some Cloud Dataprep metadata as you're going to see in just a minute.
So we've got our bucket, fantastic.
Next step, create a new BigQuery dataset and store Cloud Dataprep Output.
So we've got BigQuery already open and what we want to do is, did it insights.
Let's go in and create a new dataset if you haven't already.
So let's go ahead and create a dataset.
And for that dataset that we want to create under our project,
we're going to call it the irs_990.
And one of the reasons why we want to create
our own personal dataset for the irs_990 data,
is so we can ingest public data tables,
do our own manipulations and transformations that we're on,
in our own working space and save those for our own personal use.
So we have that dataset, fantastic.
So, now we want to do is launch Cloud Dataprep.
So Products & Services menu all the way to the bottom of your big data tools,
dataprep all the way there at the bottom.
Let's go ahead and launch that.
Agree to the Terms of Service.
3:50
Allow and allow. And boom,
you're actually inside of Cloud Dataprep.
So the bucket that we created earlier,
the one with the name that's a little bit easier to remember,
go ahead and specify that as the folder that you're going
to store that Cloud Dataprep metadata on.
Once you've done that,
the tool itself will begin to initialize.
I'm going to zoom it out just a little bit,
and this is 100 percent zoom.
We have our project. This is the Cloud Dataprep tool.
So now we're actually getting ready to wrangle in some data.
So first and foremost,
let's see where we're going to collect our data from.
So we need to create a dataprep flow.
A flow is going to govern all the steps that we need to take.
So I am going to call this the,
I am just going to copy and paste directly from the lab here,
IRS 990 Data Cleanup exercise.
Grab the description as well.
When we did duplicate some data,
filter it, do some joins, merges.
Let's go ahead and create this flow and are going to ask you
"Hey do you want to import or add any new datasets?"
and absolutely we do.
So three different options, upload them from local files in a computer,
import them from BigQuery tables,
or Google Cloud Storage.
And for the purposes of this lab,
we actually have some data for you to import from Google Cloud Storage.
So here is your Google Cloud Storage or your GS path,
let's go ahead and copy that, it's in CSV file format.
This little pencil icon up here is actually where we can specify the path,
overwriting that first GS that's there.
Click go on that,
and then you'll be able to see the dataset as its right import here.
So this irs_990_2014, go ahead and import that.
That's the first of three, boom.
As you can see, as part of your flow already we have the CSV file.
Already dataflow is beginning to do a little bit of
a recipe or wrangling of the data we're going to go into that in a minute.
But what we want to do is add a little bit more data.
So going to datasets,
clicking on that at the top,
we have the 2014 data.
We need to add the 2015 data as well.
Clicking on import more data,
same process as before.
Click on the pencil to edit.
Here, we're going to copy in 2015 data,
one at a time.
Verifying this is the CSV file that we want to import, clicking on import.
Now, I've got 2014 and 2015 the last of three files.
Clicking on import, you want to copy
over the organizational details table as irs_990_ein CSV.
Same process here, editing that path,
verifying that that CSV looks correct.
Boom. Now, we have all three in the year.
Clicking back on our flows.
You see the one we created earlier is now present.
We have the 2014 data.
We added the 2015 and the ein to our datasets but we have not yet added them to our flow.
So let's go ahead and click on Add Datasets.
I'm going to add both the organizational details or the ein
CSV and then those annual tax filings for 2015.
Selecting both those checkboxes and clicking on add.
Excellent! So now once you've gotten to this point, let's recap.
We've gotten three files in Google Cloud Storage.
We've specified them to actually load into this dataflow that we have here.
You can add many more datasets to Cloud Dataprep,
but you may not want to add all of them to one particular flow.
So, that for this flow, we just want the 2014, 2015 data,
organizational details and we're going to get ready to process them in the next lab.
But just to give you a sense of what dataprep is doing,
clicking on that recipe by default for free on a CSV file,
it automatically performs the operation of break up
the rows using this newline character as
the delimiter and then split column one into 245 columns.
So again this is taking a normal CSV and getting it ready to store in
a structured data format and then it's saying that row 1 is your header column as well.
So, automatically it recognizes the CSV.
Need some additional pre-processing even before we touch these files.
And what we're going to be doing is a preview in the next lab,
is editing this existing recipe and adding
a lot more data wranglers and data transformations on top of it.
Okay. So those three datasets are there and present.
And now, we're going to continue on with
the rest of the lecture and introduce you to a little bit more of
the feature set and then jump back into
Cloud Dataprep to perform the rest of the transformations.