Hadoop has been around for a little over 10 years now. It provides you a scale-out and cost-effective solution to store and process large amount of data – which we loosely refer to as “Big Data”. More enterprises are adopting Hadoop with an objective to utilize the advantages of the framework. With this increase in adoption comes a rise in challenges such as – Using the right set of tools to derive maximum value out of the Hadoop cluster or choosing the right SQL-on-Hadoop solution. In this article, I will focus on the requirements you might consider before selecting your SQL-on-Hadoop solution.
Why are we really talking about SQL-on-Hadoop today? Especially in Hadoop where NoSQL is the recommended way to go. Schemas and Structures on write are not mandatory. Instead Schema-on-Read is the data analysis strategy.
Before the appearance of SQL-on-Hadoop a few years ago, access to Big Data was privilege for a few elite. Someone who has a firm grip on MapReduce, Pig, HBase and HDFS could give you all the answers.
What does SQL-on-Hadoop Offer?
SQL-on-Hadoop allows the ability to utilize the existing SQL familiar resources. You can keep using your existing visualization and BI reporting tools to analyze and report big data.
This means that SQL-on-Hadoop allows a bigger set of users to query, analyze and study the data using traditional and familiar SQL Queries. This ultimately results in a higher ROI to enterprise on the big data architecture.
Here we are looking at a combination of SQL robustness with expandable, inexpensive and reliable storage in comparison to costly EDW solutions.
Which SQL-on-Hadoop Solution?
Choosing a SQL-on-Hadoop solution today can be quiet overwhelming. For all familiar with the Hadoop ecosystem, you have a wide range of solutions. Every vendor has something of their own to offer. The number of options available today can be a big challenge for CIO or CTO for an enterprise.
The first name which comes to my mind when we talk about SQL-on-Hadoop is Apache Hive. This was one of the first engines which provided the ability to users familiar with SQL to query the data on a Hadoop cluster.
Today we have SQL engines available for Hadoop which can be overwhelming to mention in a single slide. Newer SQL-on-Hadoop solutions are showing up frequently and the existing ones are maturing at a rapid pace.
A few of the solutions currently available are:
- Apache HAWQ (incubating) aka Pivotal HDB
- Apache Drill
- Apache Impala
- Apache Phoenix
- Spark SQL
How do we choose the right SQL-on-Hadoop solution?
The answer cannot be simple since all of these tools are designed keeping SQL in mind. But they differ from within. You may want to zero-in your requirement and be sure to ask yourself the necessary questions. A few of the common requirements you need to consider are mentioned going forward in this article.
SQL-on-Hadoop – Ask Yourself
True SQL or SQL Like
The first and foremost question I would ask myself – Do I want SQL-like or True SQL?
In simple terms do I want a solution which is ANSI SQL standards compatible or a solution with limited or partial SQL capabilities. Currently almost all the SQL-on-Hadoop solutions around are SQL-like and NOT True SQL.
Does the SQL-on-Hadoop solution understand the common data storage formats available in Hadoop today?
Hadoop is more than a simple text storage. Today Hadoop has evolved in a number of different storage formats which are commonly available. A few of them to mention are ORCFile, Parquet and Avro. More the ability of the SQL-on-Hadoop solution to read the data stored in the native Hadoop format, lesser the need to have data redundancy. The redundancy will happen if you need to convert the storage format for the data so as the SQL-on-Hadoop solution can interact with it.
Is the solution accessible through REST API?
To have a RESTful API is a perfect environment for your developers. It increases the range of different connectivity options you can have with your solution. One common use-case today is having web based applications talk to your Big Data over REST APIs.
Custom Functions or User-defined functions
Does the solution provide you with the ability to create custom functions?
User Defined Functions or UDFs provides you with the ability to create custom functions for your database. For an efficient SQL engine you should have the ability to write these custom functions in several different programming languages. At the same time it is critical the SQL engine should be able to parallelize the execution of UDFs across all the nodes.
How efficiently does the solution handle complex joins?
When it comes to JOIN in RDBMS things can be easier to deal with. This is not always the case with Big Data, especially if we are talking about joining billions of rows together. Such joins would results in either massive I/O contention or memory exhaustion or both. Besides we don’t want to wait for hours for the results to show up. The SQL-on-Hadoop solution should be efficient enough to handle JOIN on massive data.
Hadoop allows the data is be distributed across the cluster, joining large data sets with different join strategies needs to be handled efficiently.
How does the query execution happen in the SQL engine?
A bare minimum Rule-Based Optimizer (RBO) cannot be sufficient. Especially if you are running the engine on a distributed data. RBO does not take the data distribution into account. The answer to this problem was Cost-Based Optimizer (CBO). CBO takes into account the data distribution, indexes and statistics of the table. CBO is also able to determine the cost involved in execution of a query which is helpful in optimizing the query.
Is the solution capable to determine the location of data?
On a large distributed system it is important to take processing to the data instead of moving the data to the processing. When evaluating a solution, you should be able to deliver high-performance by avoiding latency in moving the data around the cluster.
Capable of OLAP and OLTP
Is the engine suited for my OLAP/OLTP requirements?
Many engines may claim ability to execute OLAP as well as OLTP queries simultaneously. But in reality there may arise resource contentions when you try to execute an OLAP query on a large dataset.
Other Random but Important Considerations
Below are a handful of other considerations to keep in mind:
- Multi-User Workload or Resource Management or Resource Queues
- Parallel Data Import/Ingestion
- Primary and Secondary Indexes
- Data Compression
- Concurrency and Locking
SQL-on-Hadoop – Hard Choice
Making the right choice for your SQL-on-Hadoop solution is difficult. But SQL-on-Hadoop is a path that all the enterprises will eventually take. This article should provide you with a starting point to start considering the best options for you on your SQL-on-Hadoop journey.