HDFS forms an integral module of Apache Hadoop. It forms the core components which make up the Hadoop Ecosystem. Apache Hadoop has proven to be a robust framework which provides exceptional distributed computing ability. The pace at which enterprises have adopted Hadoop is remarkable.
What is HDFS
HDFS is a Java-based distributed file system which provides high-throughput access to application data. It provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
HDFS combined with YARN provides an ideal data platform and computing solutions in the form of Apache Hadoop. A framework which can reliably be used for enterprise data. Add to this, the ability of linear growth of storage which can be accomplished by adding more “commodity” hardware into the cluster.
HDFS works in a Master/Slave architecture. An HDFS cluster consists of NameNode, which is a Master server. It manages the file system namespace and controls the access to files. Along with a NameNode, there are number of DataNodes. The DataNodes are responsible to storing the actual data on their disk. HDFS provides a namespace where users can store the files. These files are split into smaller block. Each file is split into one or more blocks. The DataNodes holds these blocks of files. The NameNode controls the filesystem namespace operations like opening, closing and renaming or moving the files or directories. It stores the block mapping to each DataNode for each and every file. NameNode is responsible to provide the client requesting a file with the block location to a DataNode. The DataNode is responsible to serve the read and write requests. The DataNode is also responsible to create, delete and replicate the blocks upon instructions from NameNode.
NameNode High Availability
It is obvious by looking at the above architecture that NameNode is the heart of the HDFS. All the file requests from client pass through the NameNode which provides the location of the blocks for a file on the different DataNodes.
Prior to Hadoop 2.0, the NameNode was considered the Single Point of Failure (SPOF) in the HDFS cluster. In the event of NameNode going down or becoming unavailable for any given reason, the entire cluster would be unavailable until the NameNode was brought back or recovered on a different hardware.
There were a few attempts to fix this situation by putting in a Secondary NameNode. Under this solution, the namespace which is loaded in the memory for NameNode is saved on disk in form of FSImage at a regular interval (known as checkpointing). Each transaction on a NameNode is recorded in the Edits file which is then rolled up in a single FSImage file. In the event of NameNode going down the Secondary NameNode holds the last known FSImage. However, it still didn’t had the real time data since the transactions between the last checkpoint and the time of NameNode failure were written in the edits file. Moreover to get the Secondary NameNode take over, a manual intervention was required. To summarize, there were some tradeoffs to this solution.
A comprehensive High Availability framework for HDFS NameNode was designed under HDFS-1623.
With the ability of having a highly available NameNode in the HDFS cluster does make a lot of improvement in terms of reliability, administration and maintenance of an HDFS cluster.
NameNode HA Architecture
In a NameNode HA cluster setup, we have two different machines (which has exact hardware configuration) configured as NameNodes. One of the NameNodes is in Active state while the other is in Standby state. The active NameNode performs all the client operations which includes serving the read and write requests. The standby NameNode maintains its state in order to ensure a fast failover in the event the active NameNode goes down.
The architecture design allows all of the DataNodes to send their block reports and heartbeat to both NameNodes. This is possible with the HDFS configuration wherein a logical name service URI and it’s corresponding NameNodes are provided. This design ensure that both the NameNodes receives exact same blocks.
The ZooKeeper Failover Controller (ZKFC) is responsible for HA Monitoring of the NameNode service. It also triggers an automatic failover when the Active NameNode is unavailable. Each NameNode has a ZKFC processes. ZKFC uses the Zookeeper Service for coordination in determining which is the active NameNode and in determining when to failover to the standby NameNode.
There two major implementations in NameNode HA:
- Quorum-based Storage
- Shared Storage
The Quorum-based implementation of NameNode HA basically uses Quorum Journal Manager (QJM).
This is currently the most common and reliable implementation for NameNode HA.
The standby NameNode communicates and synchronizes with the active NameNode through a group of hosts called JournalNodes. When there is any change in the namespace by Active NameNode, it makes a log record in a majority of JournalNodes. JournalNodes receives these transaction edits or filesystem journal from the active NameNode on a constant basis. The standby NameNode reads these journals and updates its namespace to be in sync with the active NameNode. This ensures the standby NameNode is “completely” in sync with the active NameNode.
If the Active NameNode stops responding or is unavailable, the Standby NameNode ensures that it will read all the edits from JournalNodes before it presents itself to be elected as Active NameNode.
The Shared Storage implementation of NameNode HA uses a shared storage space which is mounted on the NameNodes through NFS protocol. Both the NameNodes must have access to the directory which can be a network share mounted on them.
Anytime there is a change in the namespace, the Active NameNode makes a log record in this shared directory. The Standby NameNode constantly monitors this shared directory for edits and applies to its namespace when there is a new edit.
If the Active NameNode stops responding or is unavailable, the Standby NameNode ensures that it will read all the edits from the shared directory before it presents itself to be elected as Active NameNode.
This method however, is unreliable since a network failure or the shared directory location failure can be catastrophic.
NameNode HA – Important Aspects
Both the implementation ensures that there is only one Active NameNode at any given point of time. We do not want two NameNodes modify the same namespace for the cluster which might lead to divergence of data or maybe data corruption. Such situation is called “Split-brain scenario”. You have to implement fencing mechanism in place which ensures that during failover of NameNode, the previous NameNode is stopped completely to prevent any further edits to the namespace. This will ensure that the new Active NameNode takes over safely.