Apache Spark – A Basic Understanding

Article posted on : link to source

Before diving deep into how Apache Spark works, lets understand the jargon of Apache Spark

Job: A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.
Stages: Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations (operators) cannot be Updated in a single Stage. It happens over many stages.
Tasks: Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor (machine).
DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
Executor: The process responsible for executing a task.
Driver: The program/process responsible for running the Job over the Spark Engine
Master: The machine on which the Driver program runs
Slave: The machine on which the Executor program runs

All jobs in spark comprise a series of operators and run on a set of data. All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible. …

Read More on Datafloq

%d bloggers like this: