Earlier this month DataBricks provided an overview of Apache Spark’s next major release, Spark 2.0. The following post shows some of the changes in the abstraction, API and Libraries. Spark 2.0 is expected to be released in early June 2016.
What is Apache Spark
Apache Spark is a highly scalable open source cluster computing framework and data processing engine. Originally developed at UC Berkeley’s AMPLab in 2009, it went open source in 2010 under a BSD license. It was ultimately donated to ASF in 2013. It is now distributed under Apache License 2.0.
Spark provides a unified and comprehensive framework. This framework can capably handle the various requirements for processing large datasets. Spark provides you with high-level APIs in Java, Scala, Python and R. It is also provides a higher-level rich set of tools referred to as Libraries.
Spark 2.0 – What’s New
With the upcoming release of Spark 2.0 there has been some significant improvements in the API, Libraries and Abstraction layers. Spark 2.0 attempts to improve on these three components and is said to be 10X faster than Spark 1.x.
Let’s take a look at some of the changes in Spark 2.0.
More SQL Friendly – SQL 2003 Compliant
SQL is one of the primary interfaces Spark applications use. Spark 2.0 introduces a new ANSI SQL parser. The new parser provides good error reporting. Spark 2.0 will have the ability of subqueries (both correlated & uncorrelated). Spark 2.0 can run all the 99 TPC-DS queries.
This is a major improvement which can encourage moving of applications from the traditional SQL Engines to Spark.
Unified API – DataFrames & Datasets
DataFrames is a higher level structured data API introduced in Spark 1.3 in 2015. In a nutshell, DataFrame is a collection of rows with a schema. It provides better performance, ease-of-use and flexibility in comparison with RDD (Resilient Distributed Data) API.
For the users who prefer to use type safety a new API was introduced in Spark 1.6 called DataSets. DataSet is an attempt to provide type safety on top of DataFrame.
In Spark 2.0 the two APIs will be unified together into a single API. Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. The new Dataset API includes typed methods and untyped methods.
SparkSession – Single Entry Point
Spark 1.6 provided SparkContext API to connect to Spark cluster. There were several different context provided for different APIs. For instance to connect to SQL we required SQLContext and StreamContext for Streaming. While using DataFrames API a common confusion is to decide which “context” to use.
Spark 2.0 introduces SparkSession. SparkSession provides a single entry point for DataFrame and DataSet API for Spark. For now SparkSession will cover SQLContext & HiveContext. It will be extended to StreamContext as well.
Please note that the SQLContext & HiveContext will be present in Spark 2.0 for backward compatibility.
Spark as a Compiler – Faster Spark
Spark is known for its performance and speed. Spark 2.0 attempts to take this performance a step further. Spark 1.x – like many other modern data engines – uses the compilers which uses of various function calls and CPU cycles. These CPU cycles are pretty much spent on unwanted work.
Spark 2.0 includes the second generation Tungsten engine. This new engine works by taking the query plan and collapsing it into a single function, which eliminates all the unwanted function calls. The engine uses the CPU register for storing the intermediate data (unlike the traditional method of using memory for storing intermediate data). This method promises around 10X improvement in the performance, depending on the data you are executing.
Structured Streaming – Continous Applications
The current Spark streaming API called DStream was introduced in Spark 0.7. It provides the ability to stream real-time data and process it. Spark 2.0 introduces Structured Streaming.
Spark Structured Streaming is a declarative API that extends DataFrames & DataSets. Spark Structured Streaming is largely built on Spark SQL and also includes ideas from Spark Streaming. It is based on the Datasets API.
Spark Streaming, which uses what’s been called a “micro-batch” architecture for streaming applications, is among the most popular Spark engines. The new Structured Streaming engine will represents Spark’s second attempt at solving some of the tough problems that developers face when building real-time applications.
Essentially, Structured Streaming enables Spark developers to run the same type of DataFrame queries against data streams as they had previously been running against static queries. Thanks to the Catalyst optimizer, the framework figures out the best way to make this all work in an efficient fashion, freeing the developer from worrying about the underlying plumbing.
Upcoming releases of Spark 2.x will include more features and improvements in Spark Structured Streaming.
DataFrame based ML API
In Spark 2.0 Machine Learning “Pipeline” DataFrame-based API will become the primary Machine Learning API.
Spark has already made a mark by providing an easy-to-use, unified and fast data framework. With Spark 2.0 we can expect further improvements in the performance of Spark overall. We can look forward to the GA release of Apache Spark 2.0 in the upcoming days.