By now Apache Ambari needs no introduction. If you are working in Big Data then Ambari should not be foreign to you. The focus of this article however is to see how you can add Journalnode to an Ambari managed Hadoop cluster.

Apache Ambari Overview

Apache Ambari is a framework able to provision, manage, and monitor Apache Hadoop deployments. In case you need some more details our blog post on Apache Ambari should be a good place to familiarize yourself.

Hadoop & Apache Ambari

Ambari is a completely open-source tool which is its biggest advantage. This means we have a number of contributors helping Ambari become a better and more robust framework tool. Ambari provides a platform framework which you can use to provision, manage, and monitor a Hadoop cluster.

Ambari – even with all its prowess to manage a Hadoop cluster, is still evolving. The current distribution of Ambari provides a near complete solution to provision a resilient and highly available Hadoop cluster. One of the key areas for ensuring Hadoop is highly available is by setting up a HDFS High Availability Architecture. Ambari provides an interface to setup HDFS High Availability (also known as NameNode HA).

Hadoop HDFS High Availability Architecture

Before we proceed with “How to add a JournalNode”, I would like to provide a brief overview of “What HDFS High Availability Architecture looks like”.

The following diagram provides a layout of HDFS High Availability Architecture.

Hadoop HDFS High Availability Architecture

Hadoop HDFS High Availability Architecture

In this architecture we have two NameNodes configured. One of the NameNodes is in Active state while the other is in Standby state. The active NameNode performs all the client operations which includes serving the read and write requests. The standby NameNode maintains its state in order to ensure a fast failover in the event the active NameNode goes down.

The standby NameNode communicates and synchronizes with the active NameNode through a group of hosts called JournalNodes. JournalNodes receive the filesystem journal (or transactions) from the active NameNode. The standby NameNode reads these journals and updates its namespace to be in sync with the active NameNode.

The architecture design allows all of the DataNodes to send their block reports and heartbeat to both NameNodes. This ensures the standby NameNode is “completely” in sync with the active NameNode.

The ZooKeeper Failover Controller (ZKFC) is responsible for HA Monitoring of the NameNode service. It also triggers an automatic failover when the Active NameNode is unavailable. Each NameNode has a ZKFC processes. ZKFC uses the Zookeeper Service for coordination in determining which is the active NameNode and in determining when to failover to the standby NameNode.

Quorum journal manager (QJM) in the NameNode writes file system journal logs to the journal nodes. A journal log is considered successfully written only when it is written to the majority of the journal nodes. At any given point of time we can have only one NameNode performing the quorum write. In the event of a split-brain scenario this ensure that the file system metadata will not be corrupted by two active NameNodes.

Add JournalNode to Ambari

The HDFS High Availability Architecture setup wizard provides you an option to assign a few nodes to run as JournalNodes. You generally do not find the need to add more JournalNodes to the cluster and HDFS High Availability works perfectly. However, over a period of time with the growth of cluster size and number of nodes or for ensuring more resilience you may want to add more JournalNodes.

As of the time of writing this article, Ambari does not provide a way to add JournalNodes to HDFS High Availability Architecture in a Hadoop cluster. The good thing is Ambari is RESTful!

Ambari’s RESTful API provides us the ability that the Ambari web interface does not provide. The following is a brief overview of the entire process:

  • Identify a node which would be shouldering the responsibility of a JournalNode.
  • Assign the JournalNode role to the host using POST API call.
  • Install the JournalNode on the new host using PUT API call.
  • Update the “Shared Edits” location for pdfs-site (dfs.namenode.shared.edits.dir parameter) and add the new JournalNode.
  • Create the JournalNode directory on the host (including the directory for HDFS NameService). Sync the contents of current directory (basically all edits, edits_inprogress, epoch, committed txns). Make sure the ownership is set right (user owned by hfs and group owned by hadoop).
  • If the cluster is secure, make sure you create the journalnode principal and HTTP principal (which should be present in SPNEGO) for this host. Ensure the keytab files are present and have right ownership and permissions.
  • Finally, start the JournalNode on the new host and restart HDFS components.

We will discuss each of the above process in detail now.

Please note that I will be using “placeholder” values in these commands. You will have to replace them with the values for your setup. Replacements that you will have to work on:

* admin:admin – with the username and password you’ve set for Ambari. I am using the default username and password in my examples.
* CLUSTER_NAME – with the cluster name you have set for your cluster in Ambari.
* NEW_JN_NODE – with the actual hostname which you have added. It can be FQDN or short name which is based on what you have added the host with.

Once you identify the host which will be your new JournalNode host, you can run the following set of curl commands. I prefer to work on the host where Ambari server is running. So when I make a API call to Ambari, the host address in curl command is ‘localhost’.

Assign JournalNode

Assign the role of JournalNode using the following command:

curl -u admin:admin -H ‘X-Requested-By: Ambari’ -X POST http://localhost:8080/api/v1/clusters/CLUSTER_NAME/hosts/NEW_JN_NODE/host_components/JOURNALNODE

Install Journalnode

Now go ahead and install the JournalNode.

curl -u admin:admin -H ‘X-Requested-By: Ambari’ -X PUT -d ‘{“RequestInfo”:{“context”:”Install JournalNode”},”Body”:{“HostRoles”:{“state”:”INSTALLED”}}}’ http://localhost:8080/api/v1/clusters/CLUSTER_NAME/hosts/NEW_JN_NODE/host_components/JOURNALNODE

Update HDFS Configuration

Login to Ambari Web UI and modify the HDFS Configuration. Search for dfs.namenode.shared.edits.dir and add the new JournalNode. Make sure you don’t mess up the format for the journalnode list provided. The following is a format of a typical 3 JournalNode shared edits definition.



Create JournalNode Directories

Time to create the required directory structure on the new Journalnode. You have to create this directory structure based on your cluster installation. If unsure, you can find this value from $HADOOP_CONF/hdfs-site.xml file. Look for the parameter value for dfs.journalnode.edits.dir. In my case it happens to be /hadoop/qjournal/namenode/.

Make sure you add the HDFS Nameservice directory. You can find this value from $HADOOP_CONF/hdfs-site.xml file. The value can be found for parameter dfs.nameservices. In my example I have “MyLab”, so I will create the directory structure as /hadoop/qjournal/namenode/MyLab.

Copy or Sync the directory ‘current’ under the ‘shared edits’ location from an existing JournalNode. Make sure that the ownership for all of these newly created directories and sync’ed files is right.

Kerberos Enabled Cluster Or Secure Cluster Only

If you don’t have a secure cluster jump on to the next section.

As mentioned above, the process for a secure cluster is the same with the additional caveat of ensuring the right principals and their keytab file is present. Based on the journalnode principal and HTTP principal for the JournalNode that you’ve defined for the cluster, create the principal on the KDC for your cluster. Get the keytab for the principals.

Verify that the keytab actually works using klist & kinit.

Moment of Truth!

Time to fire up the JournalNode on the new host. It’s the moment of truth!

Using Ambari Web UI or API call, start the JournalNode service.

I prefer the Ambari API. I would use something like this.

curl -u admin:admin -H ‘X-Requested-By: Ambari’ -X PUT -d ‘{“RequestInfo”:{“context”:”Start JournalNode”},”Body”:{“HostRoles”:{“state”:”STARTED”}}}’ http://localhost:8080/api/v1/clusters/CLUSTER_NAME/hosts/NEW_JN_NODE/host_components/JOURNALNODE

Again, don’t forget the substitutions I talked about the curl command above.

Restart HDFS components for the changes to take effect. Basically recycling the Namenode (one after the other) should do this with no outage with the HDFS High Availability setup. But if you can afford downtime you can go for it.

Verify and cross check that the edits, epochs, and committed txns are written across all the Journalnodes.

Voila! There you have it! A master node with Journalnode on it.

zData & Apache Ambari

We, at zData realize the potential of Ambari. zData is proud to provide zData Ambari stack which includes Pivotal HAWQ and Greenplum. You can find more about zData Ambari Stack here.