Over the past few years we have all been enthralled with the buzz generated by IoT. Now it looks like its time for Apache Spark to take its place in the lexicon of Big Data buzzwords. While performing my research for trends on Google, I was surprised to find out that the way people use IoT today has been creeping closer and closer to the expression Big Data (but that’s a discussion for another day). For this article we will keep our discussion limited to Apache Spark.
What is Apache Spark
Apache Spark is a highly scalable open source cluster computing framework and data processing engine. Originally developed at UC Berkeley’s AMPLab in 2009, it went open source in 2010 under BSD license. It was ultimately donated to ASF in 2013. It is now distributed under Apache License 2.0.
Spark provides a unified and comprehensive framework. Said framework can capably handle the various requirements for processing large datasets. Spark provides you with high-level APIs in Java, Scala, Python and R. It is also provides higher-level rich set of tools referred to as Libraries. Here are some of the libraries included inside the Spark ecosystem:
- Spark Streaming – Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Unlike MapReduce framework Apache Spark looks beyond the batch processing. Using Spark Streaming, you can ingest data from various sources like Kafka, Flume, Twitter, Kinesis or TCP sockets. It basically works as micro batch processing method of computing. Using the input data you create discretized stream or DStream. A DStream is represented as a sequence of RDDs
- Spark SQL & DataFrames
- SQL — Spark SQL is the Spark module for processing structured data. It allows you to query the datasets using the traditional SQL-like queries and BI tools which can connect over the JDBC API. Spark SQL can also be used to access or read data from the existing Hive installation.
- DataFrames — When you run Spark SQL from another programming language the results will be returned as a DataFrame. A DataFrame is a distributed collection of data organized into named columns. You can relate this to a table in traditional SQL world or a Data Frame in Python or R, but with richer features under the hood.
- Spark MLlib – MLlib is Spark’s machine learning library. It is scalable and consists of common machine learning algorithms and utilities, which include: classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. It is further classified into spark.mllib – which includes API built on top of RDDs – and spark.ml – which includes higher-level API built on top of DataFrames. Using spark.ml is now recommended since DataFrames API provides flexibility and versatility.
- Spark GraphX – GraphX is the new component in Spark graphs and graph-parallel computation. It extends the Spark RDD by introducing a Graph abstraction with properties attached to each vertex and edge.
There are a few more libraries that include SparkR and Bagel, along with several other applications leveraging Spark.
Apache Spark – RDD
A Resilient Distributed Dataset or RDD is the basic abstraction in Spark, the most basic component. A RDD is collection immutable, partitioned elements that can be operated on in parallel. This makes RDD efficient and resilient (i.e. fault-tolerant).
The efficiency is achieved by providing the ability to process the RDD in parallel across the cluster. Furthermore, the ability to track the data lineage makes it that much more resilient.
Apache Spark – Sparking Interest
Everyone working in the Big Data vertical has heard the word Apache Spark more than once. It has become considerably more popular then Apache Hadoop itself.
Below is the comparable research for the two trends since 2013 (when Spark was donated to ASF) untill April of 2016.
Apache Spark – Fact Check
As with any hot new topic, there are quite a lot of misconceptions about the Apache Spark framework. To bust those myths we need to understand what Spark isn’t, as much as what it is.
Apache Spark is NOT an In-memory Computation Framework
The biggest misconception today is that Spark is just an in-memory “thing” — this is not at all true. Instead of offering just in-memory processing, Spark also uses the available memory to cache the data. This cached data isn’t stored there forever. For a computation engine to be called “in-memory” it should store the data for a longer period of time. Spark uses the available memory to store this cache using the Least Recently Used (LRU) algorithm. The properties for LRU cached data makes it unmodified and can be evicted if the contents are least recently used.
If the data you are processing can fit into the available memory for Spark, you can consider it as an in-memory processing framework (but this is just an assumption). In reality we already have other technologies using the LRU for memory caching, but they can’t be referred to in-memory technologies.
Apache Spark Provides a Unified Platform
Spark brings with it multiple options in a single package. Spark does not limit itself to batch processing computation like MapReduce does. It scales beyond batch processing. It has the capabilities to process real-time stream, structured & unstructured data. The possibilities also include graphing the data or driving Machine Learning using the commonly used machine learning algorithms.
You no longer have to combine the different frameworks or processing engines like MapReduce for batch processing or Storm/SpringXD for real-time data processing or Giraph for graph processing. Part of the beauty of Spark is that it is a single solution for most of today’s Big Data requirements.
Spark is NOT a replacement for MapReduce
Another misconception surrounding Spark is that it can be used as a replacement for MapReduce or Hadoop
Don’t be confused by the previous fact, Spark was never meant to replace MapReduce. Spark was designed to meet the challenges and limitations that MapReduce faces. There still are use cases that are well suited for MapReduce, but not all data processing fits into the Map and Reduce pattern. This is where Spark can help us out. It can co-exist, grow and even out-shine MapReduce, but not as a replacement.
There is more to Hadoop than just MapReduce. How can we forget HDFS ? That highly distributed file system that provides cheap and reliable storage. Spark is not going to replace Hadoop either. What it is going to do is run on top of Hadoop, or access data from HDFS to process.
Spark NOT limited to Hadoop
Spark is not limited to running on top of a Hadoop cluster only. Spark is designed to run on variety of platforms and distributed systems. Outside of Apache Hadoop, Spark can run as a standalone cluster by itself. You can also use a different cluster manager like Apache Mesos.
Even in Hadoop, Spark can run on Hadoop V1 (Spark Inside MapReduce) and Hadoop V2 (over YARN).
Spark Provides Single Programming Abstraction
Spark provides a single programming abstraction called Resilient Distributed Dataset (RDD). This abstraction is understood by APIs as well as the Libraries of Spark. The advantage of using RDD as a single abstraction is that the data is not required to be re-formatted. The stream data is a series of RDD that can be repurposed for batch processing or executing real time SQL queries to it or even machine learning algorithms on the data. The time taken to convert the data from one abstraction to another gets to a quicker resolution.
Spark is Fast
Apache Spark is fast, but the speed will certainly depend on the kind of operation you are performing. Some numbers we see over the Internet vary from 10 times to 100 times faster than MapReduce. I however believe that speed will depend on the variety of computing performed.
Take for instance a job iterating over the same set of data. This kind of job will be considerably faster in Spark, easily outstripping MapReduce. The reason being Spark’s ability to cache the data in-memory. This data will be “hot” since it is iterated over and over. All in all, this means the chances of eviction of the cached data (under LRU algorithm) are slim.
The other area where Spark proves to be faster over MapReduce is the shuffle. Thanks to Spark’s design, the shuffles are carried out in-memory itself.
Lastly, each executor has an efficient method to start the tasks assigned to it. MapReduce starts up a completely new JVM each time it has to execute a task. Spark’s executors meanwhile, fork new thread to execute a task. We could spend a whole new article discussing Spark’s executors and tasks. To put it a simpler way, the data processing in Spark is carried out on the worker nodes that run the executor. Each executor is responsible for a certain part – known as a task – of the entire processing job.
What this means is that Spark can be fast for certain kinds of processing operations and some shuffle related advantages. The way it forks out threads to execute tasks is Spark’s big advantage here over MapReduce.
Spark can Access Data from Multiple Sources
Spark can access data not just from HDFS but also from a number of different storage solutions. The following list gives you an idea of its range.
- Apache HDFS
- Apache HBase
- Apache Hive
- Apache Cassandra
- Amazon S3
- Alluxio (formerly known as Tachyon)
In conclusion, Spark is faster in comparison to the current frameworks, but the speed varies depending on a number of factors. MapReduce still has its use cases and can’t be replaced by Spark, and Spark is certainly NOT an in-memory processing technology.