By Brian de la Motte and Jonas Bull


Hadoop administration has been labor and knowledge intensive in the past. Manual configuration processes made it difficult and time consuming to reproduce an environment. But you don’t have to sacrifice flexibility for automation when installing and automating a Hadoop cluster. We will show that with a little more ingenuity and a few more tools in your tool bag, Hadoop clusters can be fully automated from zero to production-ready and extremely flexible, nimble, and agile in their approach to switch up the underlying node sizes. Our approach has been battle-tested and repeated in the field hundreds of times now. We have launched and relaunched Hadoop clusters in the last year over five hundred times on different hardware configurations with great success.

Note that this is an advanced topic and assumptions of knowledge will be made in several places. This is meant to be a high-level template for experienced AWS and Hadoop admins to make their Hadoop deployments faster, more flexible, and more secure. Although we talk about AWS a lot, our solution really was made to be flexible with any size and type of machine – on prem, cloud, AWS, Azure, VMs, etc.

The keys to making this work are leveraging and extending existing automation tools, and choosing the right tools.

Ambari Blueprints – The Good, the Bad, and the Ugly

As many Hadoop admins know, building a secure, production-ready Hadoop cluster can be quite a task. The Apache Ambari project, the open-source provisioner, manager, and monitor of Apache Hadoop clusters, makes the task automatable and reproducible, but not very flexible. Ambari’s cluster blueprint and API functionality enables you to codify your Hadoop cluster and then automatically install it using API calls, but the blueprint functionality is rigid. To automate a cluster build, the Hadoop admin must install a cluster using Ambari’s wizard-like UI interface, configure any custom changes manually, and export the blueprint. Then the admin would use this blueprint for automation. The blueprint is in JSON and contains every configuration value, even if it is a default value, and it is common to have your exported blueprint to be 6.5K lines long! If you want to change a location of a file (such as where Java keystores are for SSL) or change a memory setting, you must change it in multiple places in the blueprint. The blueprint is bloated with default values that don’t need to be in there at all because Ambari’s stack recommender service adds them for you. Still, another roadblock we didn’t like is that these memory, CPU, and storage values get set and are not dynamic to the instances. Even the number of instances in the cluster could be painful to change – and forget about adding or removing a service!

We wanted to automate a Hadoop cluster but we wanted the option to quickly change the size of the AWS EC2 instances. For example, if we’re building on AWS we may run with i2.8xlarges but after finding out that we really don’t need the horse power, we rebuild the whole cluster with i2.4xlarges instead. Well to do that, we would have to re-configure all the memory and CPU settings in dozens of places in the 6.5K line-long bloated blueprint and launch again. Sorry, but I don’t got time for that! Who wants to maintain that huge of a file when most if it is default values? We don’t like boring work like that (and at least one of us is basically lazy) so we created a workflow so we can have our cake and eat it too.

The Solution(s) to Our Problems – YAML, Templating, and Stack Recommender!

There are three things we tweaked to get both flexibility and automation. First, we transformed our Ansible blueprint from JSON to YAML which gave us some very nice benefits. Second, we converted the YAML file to a Jinja2 template to support variables and conditional logic for configuration settings and values. Lastly, we set all node-dependent settings such as CPU, memory, and storage values at install time by leveraging Ambari’s stack-recommender service BEFORE we even launch the blueprint.

Flexible Idea #1 – YAML > JSON

It just doesn’t feel good to check in a bloated blueprint file into source control when you know most of it is garbage default values. The sheer size of the file is too much to hold in your mind, the output is had to read an impossible to comment. When building our automated cluster, we knew other people were going to eventually take over the project and must edit the Ambari blueprint eventually. If we didn’t look forward to the task of editing a huge JSON file, no one else would ether. JSON is not too bad to read in small files but when all the whitespace is gone and there’s chunks of large embedded documents as a value in it, it gets hard to add or remove items to it.

The solution we came up with was to do a 1-time transform of the JSON blueprint into YAML and make this the version-controlled file for the cluster configuration. This allowed greater readability for such a large file and one thing we really wanted:  in-line commenting in the Ambari blueprint! Now when we set a value in our YAML-based blueprint, we can put comments inside the file that say why exactly its set. When you have every possible security knob turned on a Hadoop cluster, it is very nice to see WHY something is set. Also because we could better read the blueprint, we could better identify default or redundant settings. Then we could leave those out of the blueprint and get our blueprint size down from 6.5K lines to under 1000.

Ambari doesn’t support YAML blueprints of course, so we convert our YAML-based blueprint back to JSON, and post it via the Ambari API to begin the automated install.

Flexible Idea #2 – Jinja2 Templates

We chose Ansible because our client wanted us to use the framework they had in place already, but any kind of scripting language or provisioning tool that supports variable replacements could be used to do the same thing. But Ansible been working for almost a year without any issues.

Once we YAML-ized our Ambari blueprint and did a dozen more installs using our first new idea, we realized how many settings are environment specific. Our blueprint didn’t support new environments because hostnames, IP addresses, passwords were all hardcoded. When asked to spin up a 2nd cluster in another VPC, Kerberized with a different IdM server which had a different 1-way trust to Active Directory we were back to editing big files, changing the same values in multiple places. What to do when the same value is used multiple times and you need to support different settings? You make variables!

We added one more step to our process above by making a YAML Jinja2 template. This Jinja2 template gets rendered when passing in variable values whose variable definitions are encoded in the template. So ‘{{ keystore_password }}’ in the Jinja2 template is rendered to the value of that variable.

This feature alone gave us so much flexibility because all we had to do for a new environment was make a file that has environment-specific variables set! Here are just a handful of examples we were able to make into variables:

–       All hostnames and IP addresses

–       Passwords – SSL passwords, Ranger passwords, Grafana passwords, Database passwords… etc

–       SmartSense account info

o  Account name, notification email, SmartSense Id

–       Kerberos implementation

o  KDC hosts, admin_server, realm, domain

–       Data directories

–       LDAP settings for Ranger User Sync

By the end of this refactoring, almost every setting in the Ambari blueprint used variables. Nothing was hardcoded anymore. We could have different files for different scenarios. We could change the file of variable definitions and launch an entirely different cluster for a brand-new environment.

We couldn’t stop there though. We got hooked on templating. In Jinja2, you can loop over items, include other files and more. We organized the Ambari blueprint by separating each config section (core-site, hdfs-site, etc) into their own Jinja2 files, grouped files that were for the same service in directories, and then included them in the top level blueprint. We even wrapped certain blocks of config section includes with logic to say “include this configuration only when service XYZ is in the list of services to be installed”. This got our blueprint down to under 300 lines of code and with very little Hadoop knowledge one can look at this template structure in smaller chunks and easily make changes and push to version control.

This could be taken even farther, as time and need allow. We could make a variable to kerberize or not kerberize a cluster, to include or not include SSL, to encrypt or not encrypt, to install a service or not install a service, etc. The final goal being that when you build out a cluster, you declare what features you want for this particular cluster using variables, and let all the templating take over to render you a JSON blueprint that will be used for the cluster install. Are you excited yet?! This is awesome! We think so anyway. But there’s one more missing piece…

Flexible Idea #3 – Leveraging the Stack Recommender

Everyone familiar with Ambari probably knows that there is a stack recommender feature that Ambari will notifies you if things look to be improperly set or could be optimized. A popup occurs in the UI saying, for example, a memory setting is too low or high, your heapsize is wrong, etc. These stack recommenders get better and better with every new stack that comes out or new services get added to a stack. The problem is you can only get those recommendations after the initial install.

We thought about making any setting that’s dependent on the node resources to be variables in the Ambari blueprint but grew dismayed that we would have to always be tweaking the variables when we go to a different instance type. Plus, even if we set them, after the install, Ambari usually tells you to increase or decrease certain values. Just like people generally can’t optimize as good as a computer compiler, so a Hadoop admin can’t always optimize their settings to make Ambari happy on the first go around.

Then a lightbulb went off for us – what if we run the stack recommender on our blueprint BEFORE we actually installed the cluster? We could get all of the recommended settings from Ambari before the install so that when we do install, everything comes up the recommended way! This proved to be a challenging venture, but it worked.

The not-so-basic step was to install Ambari and register the Ambari agents on each host with the server so that we could use the API to get all the info about each host. The API gave us all the info of the total memory, total CPUs, # of disks and more. We wrote some fancy Python code to take the Ambari blueprint, the cluster creation template, all our variables, and the API output to feed it into the Stack Recommender python scripts that Ambari installs by default. The blueprint contains all of our settings minus any of the host-size specific settings and the stack recommender script outputs all host-size specific settings we should be using. We then merge all the recommendations with the blueprint we have to create a dynamic blueprint and use that to install the cluster!

Any settings outputted by the recommendation script that aren’t already in the blueprint get merged into the dynamic blueprint. Anything already set in the blueprint is assumed to be an override (for cases when you want your setting to be used over a conflicting recommendation). Because of this, our Apache Hadoop cluster comes up and doesn’t have any recommendations for us when we save!

Now the existing stack recommender doesn’t cover every possible service. Spark, for example, doesn’t get recommendations. However, recognizing this, we were able to take the formulas recommended for setting spark defaults and dynamically generate reasonable defaults just like stack recommender.

Note – SmartSense sometimes has recommendations still, but only because SmartSense and the Ambari recommendations don’t always see eye-to-eye.

Tying it all Together – Ansible

With our three main tweaks, we can now install a Hadoop cluster dynamically on any size of machines, on any number of machines, and only make minimal changes for environment-specific settings. We have put all the steps into an Ansible playbook so with a single command the workflow above works in a 100% automated and flexible way. All of this could be done without Ansible, but it acts as a nice glue. All sensitive info such as passwords are stored in Ansible vault and we have Ansible inventory files for each environment. To create a new environment all you need are a new Ansible vault, variable files, and an inventory file. That’s it!

At the end of all of this – we have a 1-click button in Ansible Tower that launches the instances, performs any necessary provisioning, launches the dynamic blueprint and performs any post-install tasks. A fully production-ready cluster is online and ready to be used – including Kerberos from the start, SSL encryption on all interfaces and for data-in-motion, Vormetric transparent data encryption for data-at-rest. Post-install tasks such as changing the default Ambari admin password, setting up Ambari PAM authentication, Zeppelin PAM authentication, Shiro.ini files, Zeppelin settings, and more are also all set. The result is Hadoop users are ready to jump in and use the secure cluster after our automation scripts are done. Even our HBAC rules and Ranger rules can be templated and then put into the automation workflow!

At the end of all this, we have a build process that let’s us sleep at night. We can rebuild entire clusters and single nodes consistently and reliably. We can do nearly seamless blue-green upgrades, we can build a test or development environment that matches production. And we can do it over and over. As long as we have our data backup plan in place we don’t even have to worry about losing the entire cluster – we can rebuild in about an hour. We have achieved this by using good tools and leveraging and extending existing automation.