0:16

Now, you can download this data actually.

Â This is real data that Yelp is making available for

Â academic purposes, and so you can download that data from this

Â website that they've given for a dataset challenge.

Â This data will be in jscon format,

Â something that we haven't dealt with before.

Â So we want to convert it into csv, so that we can use the tools that we already know.

Â So what I've done I've provided this script called json2csv_business.py.

Â So this is Python script that you can download with this course,

Â and on your command line you can run this command Python, the script name,

Â and the name of that json file that you downloaded from Yelp.

Â Now keep in mind that this is a large file, and so

Â 1:15

it's going to take some time to actually go have that conversion done.

Â All right, but you can do this outside of this environment,

Â and you can have it ready in a csv format.

Â So that's what will happen once that script is run.

Â Okay, now given that you've finished this, we're going to switch to R studio, and

Â here we'll load up some of the packages first

Â to start working on this data set, okay.

Â So the first thing we need to do, because we're doing the step by step.

Â The first thing we need to do is just look at the data.

Â To do that we need the library plot2.

Â We've done this before.

Â So this is the library for doing different kind of visualization.

Â Now, the next thing to do is load up the data.

Â 2:16

So we're going to load the date, and

Â right, this is business data.

Â Right, so Yelp has a lot of businesses, and

Â their data about the reviews of those businesses by the customers, and so

Â we're going to use the read.csv function, so that allows us to load a csv file.

Â 3:06

Enter that, now this is going to take a little bit of time,

Â because it's a large amount of data right, and so

Â again depending on how fast your machine is, how much memory you have available,

Â don't be surprised if this takes several seconds.

Â 3:22

Now the data is loaded in this business data, and so

Â let's just go ahead and plot it.

Â And it's so easy to do that in R.

Â And so we can just tell it that we want to plot this business data.

Â And how we want to do it, well, let's get

Â a bar chart where the x-axis will have state,

Â where the data is, the business data is from,

Â ergo where the business is.

Â And we're going to fill it using gray.

Â 4:09

And there you have it, so

Â now we have a visualization of the data that's available.

Â And let me just expand this a little bit, if you see here on the x axis,

Â you have states from where this data is.

Â 4:26

Now as it so happens that this particular data from Yelp that's available,

Â that's mostly from Arizona.

Â So don't be surprised that the most of the things are in Arizona.

Â This is not the normal thing for Yelp.

Â Yelp has data from all kinds of states.

Â But for this particular dataset has most of the business is located in

Â Arizona and Nevada.

Â So that's why we are saying, there's nothing really all that interesting.

Â This is simply to practice our skills with R, and

Â it's very easy to see how things are located.

Â 5:31

And so here's the visualization of that.

Â So one could give stars from 1 to 5.

Â And so you can see that on the x-axis we have number of stars, and

Â on the y-axis we have the number from all those businesses.

Â All right, so there are thousand of businesses represented,

Â and here's the distribution of stars, and maybe this is not surprising,

Â maybe this just confirms what we have, but it's kind of nice to see how easy it is,

Â once you have the data loaded how easy it is to just do these quick visualizations.

Â 6:34

In this case, we're going to use an expanded command for

Â this ggplot.

Â So we'll say data is

Â in business_data,

Â we're going to use x factor one.

Â I'll explain this in a second,

Â going to fill it factor of stars, and so

Â this is our sort of more expanded ggplot command.

Â And we're going to create a bar chart, With width=1.

Â 7:28

And Coord_polar(theta="y"),

Â and again I'll explain this in a second.

Â But let's first see what happens, okay?

Â So, what we created here is

Â a pie chart where it is using

Â the counts for this factors.

Â So the factor in this case happens to be the stars.

Â And so what it means is it's looking at this particular variable, which is stars.

Â Stars that are assigned to each business using that as a factor and

Â then counting how many things fall under different values of that factor.

Â Okay, so stars being 1, 1.5, 2 and so on up to 5.

Â All right, so those are the possible values for this factor,

Â stars and then what you see here are the actual counts right?

Â But of course this is a pie chart so everything needs to fit in the circle so

Â everything is proportion, right?

Â So here's the blue one that's the largest one with the rating 4.

Â The light blue one with 3.5 and again,

Â this is just confirming what we saw before with the bar chart.

Â So this is just a different visualization to show similar things.

Â And again, one of the reasons we are doing this is to practice our

Â skills and different visualization we can find.

Â As we saw before often even just being able to see this could be very

Â informative, right?

Â And all it takes is just one command.

Â 9:19

Okay, so next we're going to look at the user data.

Â We already did this,

Â we're going to look at another file that you downloaded,

Â you should have downloaded from Yelp which is the user data.

Â So what we just played with was the business data.

Â Now we want to see things that are user related.

Â So again, you'll find the data in JSON format.

Â And I've provided a script called json2csv_user.py and

Â that script again can be use just before on the command line running a Python

Â command with that script and the name of the JSON file.

Â 10:08

And the resulting file will be a CSV file that we can use

Â to load up in R just like we did before.

Â All right, so at this point, I'm assuming that you've been able to convert the JSON

Â file to a CSV file and now you can load it in R, okay?

Â So going to load it as a different dataset.

Â 10:34

And so we're going to call it, this is the user.

Â What actually we're going to first load sorry your data and

Â that will be similar to what we did read.csv function file

Â equals get the full path to where the file is, yelp.

Â 11:02

Dataset_user.json.csv.

Â And so this is the file has the user related data and

Â memory converted it from JSON to CSV.

Â And once we give this command it's going to load it up.

Â Again, it may take a few moments because this is a large file and

Â R is trying to load it in R environment.

Â So don't be surprised if it takes a little longer.

Â And this is what we're doing here,

Â loading it up a CSV file as loading up the CSV file in R.

Â Okay, now we have it loaded.

Â So now we can start playing with just like how we did before.

Â Okay, so let's see,

Â 11:54

let's extract some information from this whole dataset.

Â It's a huge table but we don't need all of them.

Â One of the things that it has or

Â some of the fields that has our, it's like number cool votes.

Â And so customer can say this business is cool, or it's funny, or it's useful.

Â So let's extract those votes, user votes.

Â So the entire data is in user data table.

Â We're going to extract, so remember this is table, so it's got rows and

Â it's got columns.

Â So a user data has two dimensions.

Â And the dimensions are separated by commas.

Â If I do this, that means everything comma everything.

Â So this will get the whole table.

Â But we don't want the whole table, we want only specific columns, right?

Â But we do want all the rows.

Â So I'm going to leave before comma empty, so that means get me everything.

Â And then after comma I want specific columns.

Â I'm going to say use a C operator.

Â Let's say I want cool votes,

Â I want funny votes.

Â These are the columns and I want useful votes.

Â These three columns and all the rows.

Â So this is what it means, that give me all the rows with these three columns,

Â from user_data table into this user_votes, okay?

Â So now we have this sort of a subset of the data, okay?

Â 13:42

Now let's ask some questions using this.

Â Does a user who has more fans get more useful votes, right?

Â One of the fields that user data has is number of fans for a given user.

Â So we can find a correlation, right?

Â We can just use the user_data right away, and say

Â 14:12

funny_votes, and

Â user_data$fans.

Â So yes, there is a high correlation, it's positive between funny votes and fans.

Â So a user who has more fans tend to have more funny votes, okay?

Â So that's probably not very surprising but

Â it's very easy to find that kind of correlation.

Â 14:42

Okay, so we're going to do something more of this.

Â We're going to actually look at how different things are related.

Â We're going to create a linear model, a regression model, right?

Â To do some progression analysis to do some prediction and things like that.

Â Okay, so let's create a linear model.

Â So this is a regression model, interchangeably linear model and

Â regression model but they are both the same in this case.

Â So R has a command or a function called lm that creates a linear model.

Â And what we want to do is see if

Â useful_votes that somebody

Â has have it's related to things

Â like review_count, fans.

Â And well, actually,

Â I already got the review_count.

Â So just see this much and the data is in user_data.

Â So now how do I know these things well?

Â These are columns, and if you like, you can open the CSV file, just be

Â careful that it could take up a lot of memory because it is actually pretty big.

Â Other option is to kind of just go here and

Â look at the user data.

Â And see that this is the data that you loaded up, it has half a million entries.

Â And here are the columns so that's how we know what those things are,

Â and that's what we are actually working with, okay.

Â Go back to, How was this?

Â And so we just did.

Â 16:54

We did ran a regression model where useful votes is our outcome.

Â All right, it's a dependent variable.

Â review_count and fans are our independent variables, okay?

Â And that's down on this whole data.

Â So this is similar to regression that we did in Python, okay?

Â Now of course, we need to get the actual coefficients from this.

Â 17:26

Well, let's do that,

Â coefficients from my linear model, right?

Â And you can list this, so this is our model.

Â Okay, and so this is the coefficient that you multiply to review_count.

Â This is the coefficient that you multiply to fans, right.

Â And this is the intercept or the constant that you add to this equation, right.

Â So in other words, useful_votes is

Â equal to 1.41 times review_count

Â plus 22.68 times fans plus -18.25.

Â 18:22

Before we have seen it with only one variable, in this case,

Â we have two variables.

Â But this is a linear regression, so you can have several

Â factors all added to each other to create the linear regression equation.

Â So this is our regression equation,

Â and here we have the coefficient information, right.

Â So if you know that if you come and if you know the fans,

Â you can predate useful_votes using this yeah.

Â Now, let's do something more with this, let's actually visualize this.

Â 19:42

So we can see, this is a very,

Â very skewed distribution if you can actually even see it,

Â right, where you can see how many reviews people write.

Â Well, not many, everybody writes a very few number of reviews,

Â so most of your data is really concentrated here.

Â So again, this is not very surprising that people write very few

Â reviews instead of in single date or signaling less than 100,

Â so this is very, very small percentage of your scene.

Â 20:19

Now, we want to go a little deeper into this to see how people

Â are distributed in terms of doing their review_count, having their fans,

Â and so on, and it's not very clear how do we analyze it.

Â Do people have a lot of fans, do you have different number of reviews,

Â and so one thing to do in that kind of case is something called clustering.

Â And so what clustering does, it's unsupervised method which means that

Â we don't know the labels.

Â We don't know how people should be distributed.

Â We just have all this user data, we have a lot of people in that, these people do

Â different things like voting on things, rating things, writing reviews.

Â Some people are more active, some people are less active,

Â but we don't know exactly what to call them,

Â we don't know how many categories can we have, how do we divide it up.

Â So it's an unsupervised method, we're not trying to predict or

Â put people on specific labels.

Â 21:32

We're just trying to see how they're distributed.

Â And so the clustering allows us to see how the data is distributed or

Â how it's organized.

Â Maybe there is some underlying organization that we're not seeing,

Â but we could perhaps.

Â And so we're going to use a technique called K-means.

Â It's a popular clustering technique where k represents number of clusters, okay?

Â And so R gives us a very easy way to do that.

Â 22:00

And so that's what we're going to, right now, work on.

Â Now R has a function, actually, I'm going to clean this up for now.

Â R has a function called kmeans, and

Â you can give it the user_data that we have.

Â 22:21

We're not interested in all the columns, all the fields,

Â we're interested in only some of them, and so

Â let's actually look at the data to see which columns we are interested in.

Â 22:39

This is our data.

Â Name is not all that useful.

Â When looking at numbers, numbers are what's kind of useful,

Â so review_count, average count, maybe date and

Â so numerical things are useful.

Â So 3, 4, 5, 6, 7, 8, 9, 10,

Â 11, so we have a total 11 columns.

Â And actually you can even see it here.

Â There are 11 columns and 552,000 rows.

Â Okay, so the user ID is just a unique random ID, name is name.

Â So we can eliminate the first two columns.

Â At the left side, just take columns 3 to 11,

Â and we'll ask for 3 clusters, and

Â I'm not sure if that's good enough.

Â I mean, normally, three is a good starting point,

Â and I'll put this in a variable cluster.

Â 23:53

Run it, and it's finished running.

Â What we're going to do is to visualize these clusters, and

Â once it's done, then it'll be easier to explain what's going on.

Â So we're go look at all the user_data,

Â we're going to have the view_count

Â as x-axis and fans as the y-axis.

Â And we're going to use [INAUDIBLE] again this will be easier to explain once

Â we have things in front of us and so I'm just going to write it for now.

Â 24:46

Sorry, it's going to take a little bit of time because there's a large

Â amount of data here.

Â It's not just loading the data, but

Â it's also creating a visualization of that data.

Â So again be surprised that this takes some time.

Â In the mean time I just want to show you what really happened here.

Â So we're doing is we first off we run this kmeans

Â 25:17

We ask it to take all this data but, only columns 3 to 11 because

Â the first two columns are not really useful in terms of representing a user.

Â Right, it's their name as the ID, that's not very represented of the user.

Â So, we take all the rules but, only these columns, we ask for three clusters.

Â 25:51

And then, so the clustering information is stored here in userCluster.

Â And then we're taking that to plot it.

Â No what we're doing with plotting is we're saying all the data was plotted.

Â We're going to do a two dimensional plot with review

Â count is x dimension, fans is a y dimension and

Â are we going to do a point base or scatter plot based charting.

Â 26:27

And so here's the visualization of those clusters.

Â I know it took a little while and if it takes too long for you or

Â by any reason it fails chances are you don't have enough memory,

Â enough processing power on your computer.

Â And in that case, what I would suggest is, instead of using this

Â whole user_data data, take a sample of it.

Â So create a subset, and I'll leave that as a homework for you.

Â We've already seen how to create a subset but this time create a subset based on

Â some condition that will get you a smaller part of this whole data set.

Â Because there's a half a million rows and so that's a lot of data points and

Â so unfortunately this is something where you will need more processing power,

Â more memory on your computer.

Â And so that's not the case then I recommend taking the subset.

Â 27:27

But whatever you do, hopefully you will have something like this.

Â Now, now that we have this clustering visualization I can explain what we did.

Â Okay, so we have this two dimensional plane.

Â On the x-axis, review_count, on the y-axis we have fans.

Â So I know that this is scattered plot.

Â So each point shows us a user and corresponding

Â review count and the number of fans for that user.

Â Now of course, what we've done in addition to that is actually created clustering.

Â So we did the clustering and if you remember we did the CAIMANS, we ask for

Â three clusters, okay?

Â 28:12

we ask it to be organized somehow separated in three groups.

Â That's what that CAIMANS did, and now when you're doing the visualization,

Â this is what it means.

Â So we visualize this is the x-axis, y-axis and then we said that color

Â each dot using the clustering information, okay?

Â So user cluster is where our the whole clustering data is.

Â That's where the whole thing generated and

Â cluster represents a number.

Â So, we because we ask for three numbers, this is what we have.

Â We have three different numbers,

Â that means three different values for the color.

Â So, those three different values using as an id for a color, and

Â that's how we've seen three different colors here.

Â And so that's a very nice, easy way to visualize our clustering and

Â you can see that there's a light blue, dark blue and a medium blue color and so

Â those things indicate three different groups that this clusterng has organized.

Â And while it's kind of difficult to see, you can Imagine that this medium blue

Â cluster is very sporadic, it's people with very high review count and font count.

Â And then we have things in the middle with moderate review count and

Â moderate font count, and then we have the light

Â blue with very little view count and fan count.

Â Okay so that's what we did.

Â Let's go back to our Clustering.

Â This is a clustering that we just talked about.

Â And with that, we conclude this session.

Â What we saw here is using read.csv function to load CSV data in R.

Â Now, we're not always lucky to have CSV data, and

Â if we don't, we saw that in this case we've got the data in JSON,

Â 30:20

but we also had help to convert that JSON into a CSV.

Â So that's another thing that could happen that if you get data from something else

Â in some other format you will have either write a program yourself or

Â in most cases you can find existing programmer script that will

Â 30:39

convert that into CSV that you understand.

Â We saw how to use ggplot library

Â to plot the data, it's very easy to create a bar chart histogram.

Â And there are a lot of functions, a lot of options of those functions,

Â we didn't look at all them but you know at least where to start.

Â We did correlation analysis to see if variables are related somehow.

Â And once we find that there is some correlation, we

Â 31:09

did regression analysis to see how exactly those variables are related.

Â Right and to remember regression gives us coefficient information and

Â constants that makes up what's called the regression Qmodel and regression line.

Â And using that information you can then calculate the outcome values.

Â And their there are times when

Â 31:33

of we just want to see if there's any underlying organization of the data.

Â And for that clustering is a great way.

Â It's unsupervised learning algorithm and unsurprised learnings technique.

Â And the algorithm that we used was CAIMANS.

Â And what it does, it simply provides us.

Â Some kind of organization.

Â In some cases it's very clear, in other cases it's not, but it could become

Â a starting point where we can start formulating some hypothesis, right?

Â So with that, we end this session on using R for social media data analysis.

Â