0:08

actually decides how to execute your data analysis pipeline

Â using directed acyclic graph scheduler.

Â And I think this is really important to understand how Spark works, and

Â is one of the most fascinating features about Spark.

Â So let's first start by the definition,

Â what's exactly is a directed acyclic graph?

Â So first of all, it's a graph.

Â So it's a collection of nodes, you see here demoted by letters A B C and so on,

Â which are connected by edges and in particular this is a directed graph,

Â so these edges have a definite direction.

Â They are actually arrows and with this you can define

Â the tendencies between one node and another node.

Â And also those graphs are acyclic.

Â So if you star from a node and

Â then you follow arrows down, you can never go back to the previous node.

Â Okay, so there cannot be circular dependency in this.

Â In fact, this is used for dependency tracking a lot and

Â In particular, if you think about this as a dependency flow,

Â you can think about, for example, B depending on A, and F depending on E.

Â And you can think about the data analysis pipeline where A and

Â G are your sources and then you go through several transformation steps and

Â then you get the final results which is F.

Â And so this is used for dependency tracking also in other system

Â not just in Spark and this is also known as keeping

Â the lineage of your transformation or keeping provenance.

Â And so in Spark out the DAG is used, so the nodes are your RDDs,

Â and the arrows are your transformations.

Â Okay?

Â So, you can describe your

Â work flow as a transformation,

Â complicated graph of transformations between different RDDs.

Â 2:32

So what you see here you remember we were comparing narrow

Â against wide transformations.

Â And here you have one element,

Â one partition which is depending on another partition.

Â And so the system is used by Spark also to recover lost partitions.

Â So if for some reason, maybe one node fails or

Â some process goes out of memory or there's some disk issues.

Â 3:06

Spark exactly knows how to recover your data

Â by going back through the execution graph and

Â re-executing whatever is needed to recreate your lost data.

Â So, you know, narrow operation is pretty easy because it's

Â a one to one connection between different partitions.

Â And instead on a wide transformation.

Â Of course, dependency is more complex.

Â So if we were to lose the first partition.

Â In the case that contains A, 1, and 2,

Â we would need to recreate the two sources.

Â So now let's look at a little more complicated example.

Â Let's go back to our usual word count task case.

Â 4:02

Here we are chaining, so we are using our transformation to tell

Â park how we want to process out data, so, here, we want to go.

Â We want to apply, first, a flat map, and then a map operation.

Â Okay? And this is the result we want to obtain.

Â So let's write this as a transformation with our diagrams.

Â So you see here the left

Â RDD is our initial text RDD where each element is aligned.

Â And then flatMap transformed this in another

Â RDD where each element now instead is a word.

Â And then we have a map that transformed the word in key value pairs.

Â And then, finally, groupbyKey, that takes the sum of

Â 5:15

in case we lose a partition, Spark

Â knows the lineage so knows how the dependency is going so

Â needs to recover everything that has been lost from the beginning.

Â And the same for the second node.

Â It's going to go back down to HDFS and

Â read the relevant

Â part of the data that has been lost and then reprocess everything.

Â Through the final output.

Â So let's now take a look at a little more complicated example

Â where we have two different data sets being read from disks.

Â So you see the bottom two nodes are exactly the same

Â operations that we've been doing before, but

Â we can assume that there isn't, for example,

Â in another RDD which is being joined with this RDD.

Â Join is another wide transformation that

Â takes all the keys from the first RDD,

Â and take their values.

Â And joins them with the values on the second RDD that have the same key.

Â And so, here, you see that.

Â If you follow the colors, you can see

Â if a partition is lost at the very last RDD.

Â Then we can track all its dependencies back.

Â And there is also another very important feature, which is

Â accomplished by this tag, and that is also the execution order.

Â So, by building this [INAUDIBLE] of dependencies.

Â Spark can understand what part of your pipeline can run in parallel.

Â So here we see that the two sections,

Â the two processing of the RDDs are independent one to the other.

Â And so they can run independently in parallel.

Â And then when they both are completed the join operation can happen.

Â And also local chain of local operation can be

Â optimized by Spark and can be run simultaneously.

Â For example our data set at the bottom we have a flatMap.

Â And in map operation, these two operations are both local.

Â So they can be executed at the same time by the same process without even actually

Â