0:08

A typical data science project will be structured in a few different phases

Â that I'll talk about in separately in this lecture.

Â So there's roughly five different phases that we can think about

Â in a data science project.

Â The first phase is the most important phase,

Â and that's the phase where you ask the question and

Â you specify what is it that you're interested in learning from data.

Â Now, specifying the question and kind of refining it over time is really

Â important because it will ultimately guide the data that you obtain and

Â the type of analysis that you do.

Â Part of specifying the question is also determining the type of question that

Â you are gonna be asking.

Â There are roughly six types of questions that you can

Â ask going from kind of descriptive, to exploratory, to inferential,

Â to causal, to prediction, predictive and mechanistic.

Â And so figuring out what type of question you're asking and

Â what exactly the question is, is really influential.

Â And so you should spend a lot of time thinking about this.

Â Once you've kind of figured out what your question is, but

Â typically you'll get some data.

Â Now, either you'll have the data or you'll have to go out and get it somewhere or

Â maybe someone will provide it to you, but the data will come to you.

Â And then the next phase will be exploratory data analysis.

Â So this is the second part, there are two main goals to exploratory data analysis.

Â The first is you wanna know if the data that you have is suitable for

Â answering the question that you have.

Â Then so this will depend on a variety of factors depending on very basic things

Â like is there enough data, are there too many missing values, things like that.

Â To more fundamental ones, like are you missing certain variables or

Â do you need to collect more data to get those variables, etc?

Â The second goal of exploratory data analysis is

Â to start to develop a sketch of the solution.

Â And so if the data are appropriate for

Â answering your question, you can start using it to kinda sketch

Â out what the answer might be to get a sense of kinda what it'll look like.

Â This can be done without any formal modeling or any kind of the statistical

Â testing of things like that just to get a good picture of what it might be.

Â The next stage, the third stage, is formal modeling.

Â So if you're sketch kind of works out,

Â you've got the right data and it seems appropriate to move on,

Â the formal modeling phase is the way to kind of specifically write

Â down what questions you're asking, what parameters you're trying to estimate.

Â And it also provides a framework for challenging your results.

Â So just because you've come up with an answer in the exploratory data analysis

Â phase doesn't mean that it's necessarily going to be the right answer and

Â you need to be able to challenge your results through a variety of

Â approaches where the sensitivity analysis are other types of analysis.

Â So challenging your model and just developing a formal framework is really

Â important to making sure that you can develop robust evidence for

Â answering your question.

Â The next phase is interpretation so once you've done your analysis your

Â formal modeling you wanna think about how to interpret your results and

Â there are a variety of things to think about in the interpretation phase

Â the data science project.

Â The first is kinda like think about how your results jive with kinda what

Â you expected to find when you where first asking the question.

Â And also you wanna think about the kind the totality of the evidence

Â that you've developed.

Â At this point, you've probably done many different analysis,

Â you probably fit in many different models.

Â And so you have many different bits of information to think about and

Â part of the interpretation phase is to kind of

Â assemble all that information to weigh the different pieces of evidence.

Â So that you know what kind or

Â which are more reliable, which are more important than others and to get a sense

Â of the totality of evidence with respect to kind of answering the question.

Â 3:45

The last phase is the communication phase.

Â Any data science project that is successful will wanna

Â communicate its findings to some sort of audience.

Â Now that audience may be internal to your organization, it may be external,

Â it may be to a large audience or even just a few people.

Â But communicating your findings is an essential part of data science in it

Â because it informs the data analysis process and

Â a it translates your findings into action.

Â So that's the last part which is not a formal part of a data science project

Â necessarily, but often there will be some decision that needs to be made or

Â some action that needs to be taken.

Â And the data science project will have been conducted in support

Â of making a decision or taking an action.

Â So that last phase will depend on more than just the results of the data size or

Â the data analysis, but may require

Â inputs from many different parts of an organization or from other stakeholders.

Â So ultimately if the decision is made,

Â the data analysis that was done will inform that decision and will support and

Â the evidence that was collected will support that decision.

Â So these are roughly the five phases of a data science project.

Â There's the question, there's exploratory data analysis, there's formal modeling,

Â and there's interpretation, and there's communication.

Â 4:59

Now, there is another approach that can be taken,

Â it's very often taken in data science project.

Â And that is to really start with the data and

Â to start with an exploratory data analysis.

Â So often there will be a data set available, But,

Â it won't be immediately clear kind of what the data set will be useful for.

Â So it can be useful to kind of do some exploratory data analysis, to look at

Â the data, to summarize it a little bit, make some plots, and see what's there.

Â And to generate some interesting questions based on the data.

Â So this is sometimes called hypothesis generating because it kind of produces

Â questions that were already there.

Â Once you've produced the questions that you wanna ask,

Â based on your initial kind of exploratory data analysis,

Â it may be useful to kind of get more data or other data

Â to kind of do an exploratory data analysis that's specific to your question now.

Â And then continue with the formal modeling, interpretation and

Â communication.

Â One thing that you have to be wary of is to do the exploratory data analysis in one

Â data set, develop the question, and then go back to the same data set.

Â And pretend like you hadn't done the exploratory data analysis before and

Â come at it with say a fresh question.

Â That goes on to the rest of the stages.

Â This could often be a recipe for kind of, for bias in your analysis.

Â Because the results were derived from the same data set.

Â So it's important to be careful about doing that and to try to obtain other data

Â when you're using the data to generate the questions in the first place.

Â So this is the secondary approach to data science that can be very useful and can

Â often result in many interesting questions that are generated from the data.

Â Data science projects have a variety of phases and it's important to kind of

Â understand which phase you're in so that you know kind of how to progress and

Â how to move forward with any data science project.

Â