0:15

Hello. Welcome to Lesson Two in Module 16.

Â This lesson is actually going to introduce how to perform statistical anomaly detection.

Â The idea here is,

Â that you can use visual or descriptive statistical techniques

Â to find anomalies in a dataset.

Â And we're going to look at some of the ways that you can do this,

Â so that by the end of this lesson you should be able to explain how

Â visual and statistical methods can be used to find anomalies,

Â for instance, fraud in a large dataset.

Â Imagine credit card transactions and you're trying to find fraud.

Â You'll also learn how to develop visualizations in Python that can be used to identify

Â anomalies and apply statistical techniques in Python to find anomalies.

Â This lesson only has the course notebook.

Â There are three different things that I want you to work on.

Â The first is visual analysis.

Â The second is the actual types of outliers you might see.

Â And the last is statistical approaches to finding anomalies.

Â So the first thing is actually using visualizations.

Â Now typically, what you do with visualizations is you're trying to explore the data,

Â understand what the data is telling you.

Â And often we look for clusters, or clumps of data,

Â or modes in a histograms,

Â understand where most of the data is.

Â But with outlier detection or anomaly detection,

Â we're typically looking not for where most of the data is,

Â but where most of the data isn't.

Â Where those small features are that are by themselves.

Â So we could look at this and we could say look there's

Â clearly most of the data right here if we put a box around it.

Â We're going to be able to take that data out and then look at

Â the low density regions where there might be outliers.

Â So we do this with a histogram.

Â We see our histogram,

Â we can of course apply a KDE to it.

Â That's what we're seeing here, a histogram of the four different iris dataset features.

Â You could see the histograms here.

Â Notice there's by models. You might be tempted to think these are outliers,

Â but if you remember some of the data were off by themselves.

Â This is still quite a bit of data,

Â probably not an outlier.

Â Now we could look at this in two dimensions as well,

Â and here we're seeing the two dimensional distribution.

Â There's a mass clumping of data here and a nice clumping of data here in the sepal.

Â In the petal, it's the same thing,

Â a big clump here but a nice relationship between them.

Â So what about Outlier types?

Â There's a lot of ways you can get

Â outliers and not thinking just about fraud but in general,

Â ways you may get data that are outlier.

Â An outlier doesn't have to mean it's a bad data point.

Â I want to emphasize that. It could simply be that you have an extreme value.

Â Perhaps they're following an expected distribution,

Â but they just have a lot of scatter for some reason.

Â Maybe the process by which the data was measured was a high noise process.

Â It's like when you're on your cell phone and you start getting

Â into an area where the signal is not very good,

Â it gets very noisy.

Â And you could still be talking but it's very hard to hear,

Â that's a higher noise environment.

Â That's what we mean by extreme values.

Â Sometimes humans make errors.

Â You may have somebody entering data into a spreadsheet

Â and they put the wrong values in a different column.

Â And so we get this transcription errors.

Â We could also have incorrect measurements.

Â Somebody simply makes the measurement wrong or calibrates a machine wrong.

Â So what I do here is I actually I'm going to

Â show the visualizations again that we saw before.

Â But now what we've done is we've added different types of data.

Â We've added that high noise,

Â the incorrect column, and the wrong units.

Â So when you look at this you could see well there's

Â regular data here and then this new data is out by itself.

Â This high noise. That makes it a little easier to detect.

Â You might say, well this wrong units it's all out here.

Â Sometimes it's easier to find these when

Â you actually look at them in a joint distribution.

Â So if we were to look at this sepal with sepal length, we might see that.

Â And we'll explore that more in the future.

Â The last thing I wanted to emphasize here was statistical detection.

Â One way you might do this is to say,

Â let's take the basic statistics of our features,

Â and see what they are,

Â and then if we have outliers we can actually say,

Â okay let's do trim statistics.

Â And the idea is that if we remove features,

Â there are instances that are at the edges of our distribution,

Â and compute a robust mean,

Â and a robust standard deviation,

Â it may be easier to identify outliers.

Â So that's what we've done here.

Â We calculate the median, and the mean,

Â and standard deviation for our data.

Â And then we calculate the trimmed mean and the trimmed standard deviation.

Â Then we add the noisy data into our sample and we compute the same values.

Â And you could see that the mean and median are still pretty consistent.

Â The standard deviation is much higher.

Â And then when we go to the noisy data,

Â the mean didn't change that much,

Â but the the standard deviation dropped a lot.

Â And it's still higher than it was for the original data.

Â So this gives a handle of how that noise may be impacting your statistical measurements.

Â So we can then make a plot,

Â and we could see here's our data set that same distribution.

Â We can then say, well we can apply

Â our trimmed statistics with two sigma and three sigma lines.

Â And we can say, let's do a three sigma cut and throw any data outside that out.

Â And that would be an example of trying to remove what we think are bad data.

Â So hopefully, that gives you a bit of a feel for how that works.

Â But now we can look at this in a different dimensionality.

Â We can actually look at this in two dimensions.

Â And when we do it in two dimensions,

Â you can see these dataset were really

Â hard to pull out before, but now they're really easy.

Â And these high noise one same thing.

Â So you can imagine trying to build

Â a representation of what this data typically looks like it would mostly be right here.

Â If we were to draw a circle, pull out that high density,

Â and we get rid of a lot of the anomalies or different types of outliers if we did that.

Â Now keep in mind, when you're looking at data,

Â you don't have the outliers marked in a different sense.

Â But you still would see data out here and say,

Â these probably aren't right.

Â And this is suspicious because it's all down here this clump all by itself.

Â So you probably would come in here and say these are the good data.

Â You still have some anomalies but you get rid of most of them.

Â So hopefully I've given you a feel for how

Â statistical and visual detection can be used to try to identify outliers or anomalies.

Â And you've gotten a feel for how you might be

Â able to apply that in a more general setting.

Â If you have any questions, let us know.

Â And of course, good luck.

Â