Before understanding what is Apache Spark and how it helps in data pipelines, let’s understand history behind big data and map reduce. Map reduce introduced ability to process vast amount of data on commodity hardware. Map reduce implementations inherently had fault tolerance, locality aware scheduling and load balancing. Programming models supported acyclic flow of data through data models. These are good for most kind of applications, however there are two big class of applications which were not completely and satisfactorily solved by existing programming models and implementations of Map Reduce. Those application classes are :
- Iterative applications
Application where output of a stage is fed as input to one of previous stage to refine things. Machine learning application are usually of this nature where same datasets are processed again and again to refine a feature. Available map reduce implementation Hadoop was inefficient in handling this use case because in Hadoop communication between two stages of pipeline happens through HDFS file system, that is disk. A stage has to process data, and then put data on to disk where next stage can pick up again. This leads to delays in processing and there is no optimization to support iterative applications.
- Interactive applications
This are applications which are ETL in nature with requirement of a console where data is to be presented. User can change input and then ask for different set of data based on input. Again, in Hadoop, each such request will be treated independently and every request will fetch data from disk. This leads to delays and application does not remain interactive anymore.
To solve challenges of these applications, different tools were introduced, like Apache Storm for stream processing or Hive and Pig for interactive usage of Hadoop, but none of these tools were solving both the problem at same time. Also, they lacked the abstraction so that they can be used for any general purpose.
How does Apache Spark solves challenges of these types of applications?
Apache Spark : In memory data storage and processing
From above analysis, we understood that the basic problem with Hadoop implementation of Map reduce is disk access. If data which is read from disk can be read from memory, system becomes efficient. Apache Spark does exactly that. All of the data is stored in memory as abstract entities called Resilient Distributed Datasets (RDDs). More on that later.
Two questions come to mind when we say data is in memory: First, how fault tolerance works as data may be lost if a node goes down? Second, what if data is more than memory available at node? Fault tolerance is solved by maintaining meta data ( Step that led to creation of this RDD) about how RDDs on that nodes are created and if node goes down, it can re-run the steps which led to RDDs at this node and get whole data. More how RDDs can be created and used in Spark later.
Second, if data is more than memory available, it is written on disk, this affects the performance, but it is edge case and not a normal working condition.
This was the brief introduction to Apache Spark. We will continue on this thread in coming post. Stay tuned.
If you see something is missing or wrong, please share and we will fix it.