HDFS has proven to be a scalable, fault-tolerant and distributed storage solution which is quickly being adopted by various industries. The distributed storage along with the ability to scale-out in a linear way makes the entire Hadoop framework very cost effective. With constant improvements in the Hadoop eco-system you can see some new and exciting features added into various components running on top of Hadoop – HDFS and Heterogeneous Storage being a few. 

What is HDFS

HDFS is a Java-based distributed file system which provides high-throughput access to application data. It provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

HDFS combined with YARN provides an ideal data platform and computing solutions in the form of Apache Hadoop. A framework which can reliably be used for enterprise data. Add to this, the ability of linear growth of storage which can be accomplished by adding more “commodity” hardware into the cluster.

Apache Hadoop – Brief Journey

Apache Hadoop recently had it’s 10 year anniversary! A journey which has seen tremendous improvement in the reliability and overall performance of the framework. Hadoop was specifically designed to run on a commodity grade hardware keeping in mind that failures are bound to happen in a cluster. The distributed processing power combined with a robust and durable storage triggered a lot of interest.

For several reasons, the framework has been implemented on homogeneous computing nodes one being being the problems faced with different computing capacity in a heterogenous cluster. Overtime improvements have made it possible for Hadoop to run seamlessly in a heterogenous cluster as well.

Hadoop today is moving fast to fulfill the enterprise grade framework requirement. One of them happens to be the enterprise hardware.

Apache Hadoop – Storage Model

The hardware horizon is also quickly changing and we are now seeing a new class of nodes which have higher storage density and low processing power. These nodes are designed with just one goal, more storage at a lower cost with the tradeoff of processing power. On the other hand we are seeing faster drives which provide lower latency to the storage medium – like SSD. Application with random IO patterns can benefit a lot if they can use these media.

Another important aspect of a storage medium to consider is throughput. The adhoc or batch process apps performance affects with the throughput it has to deal with the storage medium.

Until Hadoop 2.3, HDFS used a single storage model.

HDFS DataNode Single Storage Model

HDFS DataNode Single Storage Model – Hortonworks

Apache Hadoop – Heterogeneous Storage Model

With the release of Hadoop 2.3 HDFS started supporting heterogeneous storage model. So what does the HDFS Heterogeneous storage model really do? It allows you to define the type of storage medium used on the nodes in the cluster. The notion Storage Type can be used to identify the kind of underlying storage medium.

How does this fit in with the whole picture? I previously mentioned the dense storage nodes with less processing power and large storage – I can now add these nodes in my cluster. Since these nodes are not really meant for any computing I will use these nodes for storage only. I can store infrequently or unused data which I can term as archive.

The HDFS heterogeneous storage model will allow me to define the kind of Data Directory available on the nodes in my cluster. In the example above,  I can show that the storage type on the node is ARCHIVE.

Hadoop 2.6 release made further improvements in the heterogeneous storage model with its phase 2 implementation. It now supports additional storage types like SSD and RAM_DISK. HDFS heterogenous storage model currently supports the Storage Types as – ARCHIVE, DISK, SSD and RAM_DISK. The default being DISK.

Also starting Hadoop 2.6 release you can attach Storage Policy to a directory or a file in HDFS. A Storage policy defines the number of replicas to be placed on each tier or storage type. You can remove or change the policy on the directory or file. However, at this time you cannot modify or define custom Storage Policies.

HDFS DataNode with Multiple Storage Medium

HDFS DataNode with Multiple Storage Medium

Heterogeneous Storage Model – Use cases

The use of HDFS heterogenous storage model opens up some interesting avenues.  You can now use Storage Policies like Hot, Cold, Warm, All_SSD which can control the placement of replicas.  The  storage policy of All_SSD for the directory in HDFS which holds all the HBase data can also be set. The HBase applications can see a tremendous boost in the performance with the improved IOPS.  Or think of a ‘Cold’ data which can persist for a long time without accessing it frequently. Possibly use the RAM storage to storage a copy for a higher throughput and IOPS.

Conclusion

With the ever increasing adoption of Hadoop and HDFS as a reliable and durable storage solution, heterogeneous storage solution gives up the ability to utilize the cluster intelligently.