Apache Ambari Overview

In this article, we’ll be covering features, benefits, and functionality of the Apache Ambari project. By the end you’ll be able to see just how valuable the Ambari framework is and be convinced to give it a try!

About Ambari

Apache Ambari is a complete open-source tool and framework that is used to provision, manage, and monitor Apache Hadoop clusters[1]. It brings everything in the Hadoop ecosystem under one roof either by using its easy-to-use web based user interface or through making use of its collection of RESTful APIs. Ambari’s web interface was built with simplicity in mind. The goal of Ambari is to make provisioning, managing, and monitoring as easy as possible. The web interface is actually calling Ambari APIs, which is where the magic is really happening. These APIs can be used to automate a cluster installation with absolutely zero user interaction.

Ambari is designed is with a “server-agent” type architecture. There is a single Ambari server that is installed and ran on one host. This server is the single entry point to the cluster, runs the web user interface, and provides Ambari’s RESTful APIs. Agents are installed during the provisioning step on each host in the cluster that is specified. The server then talks to the agents for carrying out tasks like installing new services and managing the cluster.

Ambari’s name is Indian for the seat that one sits upon an elephant[2]. You can think of Ambari as the ruler over the Hadoop stack, managing everything from above.

Provisioning

Manually provisioning several servers in a cluster is quite a task. Provisioning typically includes installing the software dependencies needed for each host, setting up user and service accounts, setting and modifying configurations, and starting services just to name a few. It doesn’t sound bad until there are several hundred or thousand hosts to provision. That would take days! A mistyped value or incompatible software versions could push the delivery date out even further while the installer tries to track down why things aren’t working properly. Manual provisioning just does not scale for an enterprise cluster. That’s where Ambari provisioning comes into play. Ambari’s solution to the manual provisioning problem is a simple step-by-step installation wizard that walks you through the provisioning process of a cluster. Ambari makes what is usually the longest and most tedious step, the easiest.  Just pick the hosts you want to initially use for the cluster, select what services to be installed (HDFS, Hbase, Pig, ZooKeeper, etc.), specify what hosts should be the master, client, or slave for the specified Hadoop services, review the installation configuration, and launch the install. On each step, Ambari runs several checks to notify you when something doesn’t go as planned.

Install Ambari Agents

Initially, the installer will need to  choose whether to provide a private RSA key to either the root account or an account that can run the `sudo` command. The account needs to be created on each host prior to this step. Ambari server will use the RSA key to `ssh` into each host and install the Ambari agent. If any problems arise like a host being unavailable or the account not existing, Ambari’s web interface will have detailed logs per host to explain why it couldn’t finish the agent installation. There’s a “Retry Failed Hosts” button so that the installer can try to reinstall the agent on the failed hosts after determining the cause of failure. If for whatever reason, the installer cannot provide a private RSA key for the automatic installation of Ambari agents, there is another option of manually installing the agents before moving forward. The installer will also need to select a Stack, which will be covered further down.

Choose Services, Assign Masters, Select Clients & Slaves

Once the agents are installed on each of the hosts, the installer chooses the services to be installed, assigns master roles to specified hosts, and selects which hosts will be slaves and clients. Services can also be added or removed anytime through the web interface or API calls in case the installer doesn’t want to install everything all at once or forgets to install a specific service. Assigning masters  like which host will be the name node or secondary name node is done with select boxes. Ambari will often fill out a default recommendation for which hosts should run certain masters but can be easily changed. Slaves or clients are configured last with checkboxes. The installer checks which host will contain which clients and slaves, for example, running as a data node or running an HDFS client.

Customize, Review, and Install

The last step before the installation actually happens is the customization. The only required thing to do during this step is to provide an email address and some authentication info for setting up Nagios and Ganglia (the services required for monitoring) before moving on to the installation part. However, Ambari will allow the installer to edit any configurations should there be an need to change, fine-tune, or tweak any default settings to the service configurations. A final review of the installation shows the cluster layout and then the Ambari server begins installing the services across the cluster, configuring everything on the way. Ambari will start the services after they have been installed and run through a final health check to make sure the installed services are working properly.

Scope of Ambari Provisioning

Does Ambari intend to replace some of the more generic provisioners such as Puppet, Salt, Chef, etc. for the entire cluster? In short, no. The scope of Ambari’s provisioner is strictly to install Hadoop services and handle the configuration of Hadoop services. The Ambari framework is not equipped for provisioning steps outside the scope of the Apache Hadoop ecosystem. The provisioning process in the Hadoop ecosystem can be automated with the Ambari API, but some provisioning steps outside of Ambari need to happen before Ambari can do it’s Hadoop-specific provisioning magic.

Managing

Let’s face it, there’s lots of work involved in managing a large Hadoop cluster! Just like provisioning, managing a cluster can sometimes be a pain in the butt and time intensive. Services can be added or removed or hosts can be added or removed. User authentication is another big one on the list, as well as all the fine configuration changes and tweaks along the way to different services. It just so happens that Ambari excels in this space as well. But before I begin explaining how Ambari’s a rockstar at managing a cluster, it’s important to understand the concept of lifecycle management.

Lifecycle Management


Ambari’s excellent managing capability is centered on a lifecycle management style. Any service that’s been integrated to work with Ambari responds to defined lifecycle commands (start, stop, status, install, configure) so Ambari has the unique flexibility to add, remove, or reconfigure services at anytime. This is what lifecycle management is! The cluster can have services added or removed at any time as the usage of the cluster adapts. Maybe you want to try a new service out on the cluster but decide you don’t need it and uninstall it, or perhaps you didn’t install all the services you need during the initial cluster provisioning. Ambari supports managing any of the services through the web interface or through its API.

Lifecycle management in Ambari is designed from a necessity since software changes now faster than ever. New services that fit in the Hadoop ecosystem are popping up everywhere and a Hadoop cluster provisioned a year ago might be used for something that wasn’t even possible at the time of initial provisioning.

Services, Components, and Hosts

So what can you actually DO with Ambari’s management capabilities? After logging into the web interface, there’s a dashboard where at a glance you can see which services are started or stopped, a plethora of reporting widgets and heatmaps for monitoring, and a log of anything that Ambari is currently working on in the background. The following list are some of the most important management capabilities that Ambari provides:

  • Stop, start, restart, add, or remove services
  • Add or remove hosts to a cluster
  • Put specific hosts or entire cluster in maintenance mode
  • Move a name node or secondary name node to different host
  • Restart the entire cluster using rolling restarts
  • Run service checks to verify service running and responding correctly
  • Decommission or recommission data nodes
  • Edit service and component configurations
  • Rollback configurations
  • View history of past configuration changes
  • Restart services after configuration change
  • Define host configuration groups for easier management
  • Search for specific hosts by name, ip address, hardware specs, or services installed

Various services also provide more specific management capabilities. For example, for the HDFS service you can view the name node logs, the thread stacks, or go to the HDFS name node web user interface directly. Or, for the MapReduce service, you may view the job history, job logs, and currently running jobs through its own UI.

Host management is very friendly. New hosts can be added in another easy step-by-step wizard. Ambari will also give a breakdown of hosts that have their disk usage, ram, load average, components installed, and specific configurations. A lot of the host management also ties into the monitoring capability powered by Nagios and Ganglia.

User Authentication

By default, Ambari is set to use a single user login for management if using the default settings for the installer, however Ambari fully supports integration with LDAP or Active Directory during the setup of the Ambari server. After the initial provisioning, Kerberos can be setup in the web interface to work with Ambari as well.

Monitoring

Ambari’s monitoring tools use two other open source projects: Ganglia and Nagios. Both come preconfigured with Ambari out of the box after a cluster has been provisioned.

Ganglia

Ganglia is used for monitoring, trending patterns, and metrics collection in the cluster. Ambari’s web interface leverages Ganglia to support metric views in the form of customizable widgets. Custom metrics can be created by another feature called Ambari Views. Here’s just a sample of some of the widgets provide:

  • HDFS Disk Usage
  • DataNodes Live
  • HDFS Links
  • Memory Usage
  • Network Usage
  • CPU Usage
  • Cluster Load
  • NameNode Heap
  • NameNode RPC
  • NameNode CPU WIO
  • NameNode Uptime

Ambari also utilizes Ganglia for detailed heatmaps which are a great way to quickly see which hosts are utilizing too much resources or not in an acceptable threshold for a certain monitored value. Here is a list of some heatmaps:

  • Host Disk Space Used
  • Host Memory Used
  • Host CPU WIO
  • HDFS Bytes Read & Written
  • Garbage Collection Time
  • JVM Heap Memory Used

Nagios

Nagios is primarily used for health checking and alerting. During the Ambari wizard installation, the installer must provide a support email for Nagios alerts. These alert emails come set with the notification type, service, host, host address, state, date and any other additional info. The email below is an example of when there is a problem with missing blocks in HDFS.

 Notification Type: PROBLEM
 
 Service: HDFS::Blocks health
 Host: c6402.ambari.apache.org
 Address: c6402.ambari.apache.org
 State: CRITICAL
 
 Date Time: Mon Sept 15 16:26:10 UTC 2014
 
 Additional Info:
 
 CRITICAL: missing_blocks:3, total_blocks:3

The next email example is the email sent when the problem is resolved or no longer an issue.

 Notification Type: RECOVERY
 
 Service: HDFS::Blocks health
 Host: c6402.ambari.apache.org
 Address: c6402.ambari.apache.org
 State: OK
 
 Date Time: Mon Sept 15 16:28:10 UTC 2014
 
 Additional Info:
 
 OK: missing_blocks:0, total_blocks:3

Automation & Integration

Ambari was built from the ground up with these last two uses in mind. There are three pieces to automation and integration: Ambari Stacks, Ambari Blueprints, and the Ambari API.

Stacks

Ambari stacks  are a way to define a group of services together by defining a set of available services that can be installed, where the service software packages can be found (repos), and specific information for various services (HDFS, Zookeeper, HBase, etc.). Each service has a set of scripts that adhere to certain lifecycle commands (start, stop, install, configure, status) so that Ambari can call them when a request is made for that service from the API or web interface. Services inside a stack also contain meta information in the form of XML files to define which scripts are for what command, what components are part of a service, dependencies, and default parameters.

The great thing about stacks is that they are extensible and support versioning. Each stack definition contains a version and can inherit everything from a previously defined stack definition. If there is a base stack that needs to be modified, a new stack can be defined that inherits everything else from a previously defined stack. This makes it easier to get started on a custom stack with new or custom services. Since a stack definition has different versions associated to it, it’s easy to keep track of incremental changes to services that have been modified or upgraded to a new version.

Anyone can create their own stack definition with their own custom services or custom flavors of Hadoop services. Currently, Hortonworks provides their stack with Ambari by default but their stack definition is completely separated from the Ambari project. As long as the new stack definition and service definitions are defined with proper meta data and lifecycle scripts, Ambari should be able to provision, manage, and monitor it!

Blueprints

An Ambari Blueprint is a declarative way to describe how to create a cluster install from scratch in an automated way. It is a JSON file containing descriptions of the configuration of the cluster and associated services. A blueprint can contain configuration settings for specific services but the heart of a blueprint contains all host and all associated services.

The relationship between a stack and a blueprint is that a blueprint uses a specific stack version to provision a cluster. A stack definition is just a collection of services, service related scripts and configurations, and repository information but contains no information about a specific implementation of itself. A blueprint is a map to describe how a cluster should be installed specifically by each host, service, and service component. The blueprint needs to know which stack to utilize.

Blueprints can be created by hand or even exported and later reused from an existing cluster. All it takes to automate a cluster install is registering a valid Ambari blueprint with the Ambari server, then making an API call to start the installation [4].

API

The API can be used for automating anything related to provisioning and managing or for integrating Ambari’s power into other existing systems. The web interface strictly uses Ambari’s APIs for everything you see on the screen and everything that’s happening in the background. The interface is tying everything together and displaying the information in a presentable way. Almost anything if not everything a user can do through the Web UI can be done using the Ambari API.

Below are some examples of what you can do with the API:

  • Get access to monitoring and metrics information
  • Get resource usage of specific services
  • Create, delete, and update services
  • Start and stop services
  • Delete entire cluster
  • Query cluster with parameters

With the API, users can integrate Ambari into other existing software systems such as Microsoft System Center or TeraData Viewpoint and other plugins like Openstack Sahara. This allows Ambari users to continue using one software system instead of having multiple tools and systems to switch back and forth from.

To make calls to the API, send the Authorization: Basic header which should contain a username and password along with the rest of the request. An example talking to the API with curl is:

curl --user name:password http://{your.ambari.server}/api/v1/clusters

The API can be leveraged with Ambari Stacks and Blueprints to do a completely automated cluster installation. Provide a blueprint file that defines which stack to use and then make an API call to take care of the rest! Below is an example of how to do this:

 curl –-user name:password -H 'X-Requested-By: ambari' -X POST
-d @ambari-blueprint.json http://{your.ambari.server}/api/v1/clusters/{cluster-name}

In this command, we are using the curl program to post a request to the Ambari RESTful API to provision the cluster with an Ambari Blueprint definition (ambari-blueprint.json).

Summary

We’ve covered the main uses of Ambari and have seen the advantages of using such a tool. Ambari gives a single point of entry into a Hadoop cluster to provision, manage, monitor, integrate, and automate anything in the Hadoop ecosystem. Ambari simplifies and speeds up provisioning by taking care of the dependencies of installing various software packages and starting services in the correct order. It makes administration a breeze by adding and removing hosts or services, and starting, stopping, and restarting services with a click of a button or an API call. Ambari’s cluster monitoring capability comes predefined so metrics and alerts work immediately after the install, saving the time and work usually needed to setup monitoring through a different tool. Lastly, managing, monitoring, and provisioning is all possible through Ambari’s web interface and its powerful API. The integration and automation capabilities through its RESTful API allows other tools and software systems to leverage the power of Ambari and allows developers to automate cluster installations.

References