0:03

In this case study, I'm going to talk more

Â about exploratory data analysis techniques, and how to

Â use them, on a data set that involves

Â using smart phones, to, kind of, predict human activities.

Â So remember, just any exploratory data analysis, you have to

Â have a sense of kind of like, what you're looking for,

Â what might, and what might be the kind of the

Â key priorities that you want to get outta your data set.

Â And so that will help you guide kind of what you looking at and how you

Â approach it um,remember that,the basic idea

Â of exploratory data analysis is you want

Â to kind of produce a rough cut of the kind of analysis that you

Â ultimately maybe want to do so maybe this isn't going to be perfect

Â its not going to have all the right bells and whistles to it.

Â But it's going to give you a rough idea of kind of what

Â in, kind of information you're going to be able to extract out of

Â your data set and what kinds of questions you're going to be

Â able to feasibly answer and, and what questions might not really be possible

Â to answer with the given data set.

Â So, so exploratory data analysis is really important because it rules

Â out certain questions, and it kind of pushes you along other directions.

Â It really allows you to give you that rough

Â cut analysis that can,can take you to the next step.

Â So let's take a look at the Samsung data

Â set in this example and see what we can find.

Â 1:19

So um,the data set here comes from the

Â University of California Irvine or U.C.I. machine learning archive.

Â And it's based on predicting people's movements.

Â be, from the Gal, from the Galaxy pho, Samsung Galaxy phones.

Â So, here's a picture of the Samsung Galaxy S3.

Â The actual data set was was produced using the Galaxy

Â S2, and but the,the idea is kind of basically the same.

Â So, in each of these's, phones.

Â There's an accelerometer and a gyroscope.

Â And so it helps you kind of, to understand the

Â kind of three dimensional position and acceleration of a person.

Â Assuming that they are holding their phone.

Â 1:57

So this is where the data set comes from.

Â This is the UCI machine learning repository.

Â you can go to the link to learn a little bit more about the data set.

Â How it was collected, and kind of what is available on the website.

Â And so, we, I've downloaded

Â a subset of the data, which is just the

Â training data set for the purposes of this lecture.

Â 2:18

So the data been processed a little bit to make it a little bit easier to use.

Â Basically you get a matrix, The, that has kind of, has

Â the observations on the rows and the various features on columns.

Â And you see that at the bottom here, I've

Â got the activity label which is the kind of the,

Â for each row that tells you what the person was doing at that time.

Â And so for example there is six possible activities

Â that you can be doing; there's laying, sitting, standing.

Â Walking, walking down and walking up.

Â 2:49

And, the ideas that you, you want to

Â be able to kind of deter,separate out these six

Â activities based on the many features that

Â are collected by the accelerometer and the gyroscope.

Â And so

Â the, the I listed the first 12 features here.

Â I can see that that they have body acceleration as the mean

Â standard deviation, the mean absolute deviation,

Â the maximum of each of these features.

Â 3:15

So one thing we can do really quickly is just

Â to look at the average acceleration for the first subject.

Â So the first thing I'm going to do is

Â just, convert the activity variable into a factor variable.

Â And then using the transform function.

Â And them I'm going to just subset out the the first subject.

Â So subject equals one and I'm, for the rest of this presentation

Â I'm just going to ignore the rest of the subjects for a moment.

Â Um,and so.

Â If I plot the first

Â subject, I can look at the first column, and that's

Â the, first column is the body, the kind of the body

Â excel over the mean body acceleration in the x direction,

Â so acceleration's going to divide into three dimensions, x, y, and z.

Â 3:52

And then the second plot here is going to

Â be the, body excel, the mean body acceleration

Â in the y direction and, and I've color

Â coded each of the, activities, by I'm sorry.

Â I've color coded each of the activities.

Â So you can see for example, on the left hand plot, that there's

Â green, there's red, black, blue and some

Â alternate activities so part of the problem

Â with the left hand plot is that you can't tell which activity is which,

Â so on the right hand plot, I added a legend, using the legend function.

Â Just so you can figure out kind of which Activities correspond to which color.

Â And so you can see that the green is

Â standing, the red is sitting, the black is laying, etc.

Â And so you can see that, for example, the mean body acceleration is ah,relatively

Â kind of uninteresting for things like sit,standing and sitting and laying.

Â But for things like walking and working

Â down and walking up, there's much more variability.

Â In the, in the mean body acceleration for the x direction.

Â 4:50

We can try to cluster the data, just on the average acceleration.

Â So, I've taken just the first three columns of this matrix,

Â and I calculated a distance matrix using the DIST function.

Â And I'm using a Euclidean distance just as the default.

Â And I can call the hclust function to

Â do, to do a hierarchical clustering of these data.

Â And I've called this my plclust function just to visualize it.

Â And you can see that the clustering is a little bit messy.

Â And there isn't any kind of clear pattern going on.

Â All the colors are kind of jumbled together at the bottom.

Â And so we might need

Â to look a little bit further to try

Â and kind of extract more information out of here.

Â 5:26

Another thing we can look at is the

Â maximum acceleration for this, for the first subject here.

Â And so I look at, I'm plotting here columns ten and 11.

Â And and so you see that column ten is the body, the maximum body acceleration in

Â the x direction, and, and column 11 is

Â the maximum body acceleration in the y direction.

Â And so you can

Â see that again for things like laying and standing

Â and sitting, there's not a lot of interesting things going

Â on, but for walking in, and walking up, and

Â walking down, the maximum acceleration shows a lot of variability.

Â So that may,may be a predictor of those kinds of activities.

Â But maybe early separating, kind of not moving from

Â moving, which might be kind of obvious in retrospect.

Â Um,so if you cluster

Â the data based on maximum acceleration, you can see that there's two very clear

Â clusters on the left hand side, you've

Â got the, kind of the various walking activities.

Â And on the right hand side you've got the

Â various, you know, non moving activities, laying, standing, and sitting.

Â And so, beyond that, things are a little bit jumbled together, you

Â can there's a lot of turquoise on the left and so that's.

Â That's clearly one activity, but in the

Â blue and the kind of magenta kind of mixed together.

Â 6:35

And so,

Â a cluster based on maximum acceleration seems to separate out moving

Â from non moving, but then once you get within those clusters.

Â For example, within the moving cluster or

Â within the non moving cluster, um,then it's

Â a little bit hard to tell what is what, based just on maximum acceleration.

Â 6:54

Um,we can try a little singular,singular value decomposition

Â on this data, just to explore what's going on.

Â Now before I do the SVD, I'm going to do

Â it on the entire matrix, which is 560 something um,columns.

Â I'm going to remove the last two, the last two columns are just

Â the activity identifier and the subject

Â identifier which are not real interesting data.

Â So I, I get rid of the five, the columns 562

Â and 63 and then I run the SVD on the data.

Â 7:19

And you can see,

Â I'll take a look at the first and the

Â second left singular vectors and color code them by activity.

Â And again, you can kind of see there's a similar type of pattern.

Â The first singular vector really seems to

Â separate out the moving from the non moving.

Â So you can see that there's a, a kind of a green, red, black on the bottom.

Â And the blue, turquoise, magenta on the top.

Â 7:41

And then the sec, the second singular vector's a little bit somewhat a

Â little bit more vague, what it's looking at.

Â It seems to be separating out The magenta color from all the other clusters

Â and so I think this is the walking down, or walking up one of those two.

Â And so it's not clear what is different about that, that it

Â kind of highlights, that gets highlighted on the second singular vector here.

Â 8:19

is kind of, is, is kind of producing the most variation, or is

Â contributing to the most So the

Â variation between the various, the different observations.

Â And so we

Â can, we can, we can use the which.max function to figure

Â out okay, which of the 500 or so features corresponds to

Â the, the, the kind of largest, or contributes most of the

Â variations across observations, and I say that to an object called maxContrib.

Â And then I'll cluster based on the maximum

Â acceleration plus this extra feature and I'll, and I'll

Â calculate the distance matrix to run the h plus function and you can see now the kind

Â of various activities seem to be separating out a little bit

Â more, at least the three movement activities have clearly been separated.

Â We've got the magenta, the dark blue and the turquoise all

Â separated out the various non moving activities seem to be all kind

Â of mixed together too so the, whatever this maximum contributor happened to

Â be it didn't really help to separate out the non moving activities.

Â But it seemed to help a lot in terms of separating out the movement activities.

Â 9:25

So, this max contributor was the body acceleration, the mean

Â body acceleration in the frequency domain for the z direction.

Â And so this was a, kind of the, the body acceleration.

Â For the z direction where they applied and you transform

Â and they give you the kind of frequency components from that.

Â So that's kind of interesting.

Â We can try another clustering technique here which is K-means clustering.

Â Ah,and one

Â of the things about k-means clustering that you have

Â to be a little bit careful about is that you

Â can get kind of different answers depending on, you

Â know how many times,starting values you've tried and how and

Â how often you run it so whenever you, when

Â you start k-means it has to chose a starting point

Â for where the cluster centers are often it will

Â just chose, most algorithms will chose a random starting point.

Â So if you chose a random starting point

Â you may get to a solution that is suboptimal.

Â So if you chose a different starting point you may get

Â to an even better solution.

Â And so it's usually good to set the nstart argument to be more than one so you can

Â start at many different starting points, just so you

Â can get the optimal, or, a more optimal solution.

Â So here is one clustering that we've done with k-means.

Â And you can see that the, I've specified six

Â centers, so I know that there are six clusters.

Â So I'll just specify them right away.

Â And you can see that the,

Â some of the clusters kind of jumble together.

Â So you can see cluster three is

Â a combination of laying, sitting, and standing.

Â Whereas cluster one is walking, cluster, clearly walking.

Â Cluster two is walking down.

Â Cluster four is walking up. Cluster five is just walking.

Â And again, and cluster six is a mixture of laying, sitting and standing.

Â And so you can see there, k-means here had a little bit, had trouble separating

Â out also the laying, sitting and standing from

Â the, the three, the in, in, in the clusters.

Â 11:13

If you try it again, you can see the arrangement's a little bit different.

Â But again, cluster two for example It's a mixture of

Â laying, sitting and standing, cluster five similarly a mixture of sitting

Â and standing, but some of the, but the other clusters

Â seem to, the other activities seem to cluster out very, easily.

Â 11:38

You see that things seem to separate out a

Â little bit better, not much better than last time.

Â You can see cluster one is a mixture again of laying, sitting, and standing.

Â Cluster two is clearly laying.

Â Cluster three is clearly walking and cluster four is walking down and

Â so you can see how these things kind of cluster together and

Â I'll do a second try with 100 starting values.

Â And you see, this is going to, probably going to be our best effort.

Â And cluster six still is a mixture of three

Â activities, and cluster five is a mixture of two.

Â So you can see kind of, can see where the kind of cluster centers are.

Â And the idea is that each of the clusters Has a

Â mean value or a center in a, in this 500 dimensional space.

Â And so we can see kind of which features of these 500

Â features seem to drive the location of the center for that given cluster.

Â And then, that will help us, help give us some idea of you know what features.

Â Seem to be important for classifying people in

Â that cluster, or classifying observations in that cluster.

Â So for in the first cluster here, which seems to correspond to laying, you can see

Â that the center has a, a relatively high value for a high, or positive values for

Â 13:05

is, corresponds a little bit more, has, has some more interesting values for

Â other Features so there's mean by

Â this mean acceleration there's also max acceleration

Â that seems to have a kind of subinteresting values.

Â So one of the things that you can do by looking at the

Â cluster centers is to see well what

Â features seem to have interesting values that

Â kind of drive the location to that center And, which could give you a

Â hint, in terms of what features will be most useful for predicting that activity.

Â So this is a just a short demonstration to show how you can

Â take a large data set with lots of features and lots of observations.

Â And start to explore it a little bit with various clustering techniques.

Â We use Hierarchical clustering, use k-means

Â clustering, and we use the singular

Â value composition to look at various features of, of this data set.

Â So given what we've learned here, we may want to be interested in

Â following up on kind of what's separates out the various non movement activity.

Â So in terms of laying, sitting, and standing, you know, we seem to

Â have some difficulty At least on the

Â first glance, separating those three activities out.

Â The movement activities in terms of walking.

Â Walking up and walking down.

Â We seem to be able to kind of separate those out into separate clusters.

Â Usually just a few variables most of them max accelerations variables.

Â But the non movement kind of activities seem to harder to separate out.

Â So, the nice thing about the exploratory data analysis is that it gives

Â you this rough cut, that tells you kind of where to spend your energy.

Â So, you probably don't, may not have to spend too much

Â energy on the movement activities, but maybe you need to spend, look,

Â dig a little bit deeper looking at the kind of non movement activities.

Â So I hope you find this useful in terms of how to get started using clustering

Â techniques and how to get a look at the data and and,and kind of further your

Â analysis and,and to kind of get you going for ah,more formal analysis.

Â