With the ever increasing buzz of Internet of Things (IoT) which now is slowly moving towards Internet of Any Thing (IoAT) we are finding the need of an ideal solution to move the data between various processing platform. The answer to this challenge is provided by Apache NiFi.
The bright and shiny new data orchestration tool provided by Hortonworks – Hortonworks DataFlow (HDF) – is powered by Apache NiFi.
What is Apache NiFi
In a nutshell – Apache NiFi is a powerful, reliable and easy to use tool to process and distribute data.
Originally designed NSA – Yes NSA! – it was developed and used by NSA as Niagara Files. After having used it for around 8 years by NSA the technology was open sourced through Apache Software Foundation. The transfer of the technology was done through NSA technology transfer program. Please note that this isn’t the first time NSA has contributed to open source community. The previous contribution included Accumulo.
What is Hortonworks DataFlow (HDF)
Hortonworks DataFlow (HDF), powered by Apache NiFi, is the first integrated platform that solves the real time complexity and challenges of collecting and transporting data from a multitude of sources be they big or small, fast or slow, always connected or intermittently available
Hortonworks DataFlow provides a perfect compliment to Hortonworks Data Platform (HDP). Running HDF on top of HDP can open multiple opportunities to design, grab & process various different forms of data.
Common Use Cases for Hortonworks DataFlow (HDF) or Apache NiFi
Here are some of the uses cases which HDF or Apache NiFi can be used for. However the use cases aren’t limited to these. There are every increasing number of “processors” (we will take a look at what this really means later in the article) which increases it’s reach and use case.
- Data Ingestion & Streaming
- IoAT Optimization
- Data Security
The purpose of this article is to provide a brief introduction of “how-to” design a DataFlow using HDF running on top of HDP.
For the purpose of this example, I have used Hortonworks Data Platform standalone installation with version 2.3 and Hortonworks DataFlow NiFi version 1.1.1.
Demo – Setup overview
Let’s dig into the demo.
Thankfully Ambari provides you with the option of customization. I have got the NiFi service installed through Ambari.
Here’s how my current cluster looks like.
NiFi is installed under /opt directory and the service is listening on port 9090. You can access the NiFi UI using the link like below:
In my case since it’s a standalone box, I have the same hostname for all services (hdp-ambari-1.gagan.com).
Here’s how the interface looks to begin with.
Demo – Designing Simple DataFlow
The UI looks pretty neat with a blank canvas that you can start designing your data flow on.
We will be designing a DataFlow which will pull the data from a local directory (/source_directory) and store it in a directory inside HDFS (/destination_directory) running inside the HDP cluster I have.
The important component to understand here is that each stage or a process in data flow is called “Processor” in NiFi. In this simple dataflow we have two processors –
1. First to get the file from the filesystem;
2. Second to put the file in HDFS.
At the time of writing this article there are 111 processors. It includes a variety of different processors which can be used for various processing or ingestion or streaming. This number is definitely going to increase in future.
Let’s get started!
- Click on Processor icon on top left hand corner and drag it on to the “canvas”.
- It will open up another window to select the processor “Type”. Since we are going to get the file from local filesystem, it will be easy if you click on “filesystem” tag on the left. Find “GetFile” processor and double click it to open.
- This will be the first Processor that we will configure. Right click on the newly created processor and click “configure”.
- Now this is the place where you will be providing all the required details. Click the “Properties” tab. Please note that all the “Property” which are ‘bold’ are the ‘required’ fields for the processor.
- We will keep things simple and provide the value for “Input Directory” which in our case is “/source_directory”. If you are an explorer feel free to play around with the other “Property” values in this tab.
- You can pretty much ignore the other tabs except for “Settings” where you can provide a custom name for the processor. By default the name of the processor is the processor “Type”. Changing it to something more understandable can be a good practice. Save the changes by clicking “Apply”.
- Now time to drag down another processor. Click and drag another processor. Search for “HDFS” in the box on top right side. You can also select “hdfs” or “hadoop” from the tags section. Select and double click “PutHDFS”. Right click on the newly created processor and click “Configure”.
- Under the “Settings” tab you can provide a custom name for the processor. Since this is our final processor in the DataFlow, we will terminate the flow with either success or failure. Which is why check the boxes for “success” & “failure” under the “Auto terminate relationships” section.
- Move on to the “Properties” tab provide the value for “Hadoop Configuration Resources”, if you have a custom hadoop configuration. You will have to provide a comma separated absolute path to core-site.xml and hdfs-site.xml. For instance “/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml”. Provide the value for “Directory” inside HDFS where you want the files to be moved/copied. Save the configuration.
- Now hover over your mouse on the source (GetFile processor in our case). Click and drag the small “arrow” towards the destination (PutHDFS processor). Release the mouse when the destination turns green.
- Click “Add” in the Create Connection window. Since the GetFile processor has only one possible relationship of success it is selected by default.
- Now press the “Shift” key and drag a selection box around the two processors (along with their connection). This will ensure all the components are selected at once.
- Now click the green play icon to “start” the DataFlow.
Demo – Test DataFlow
Now it’s time to test the DataFlow.
I will be copying some log files to /source_directory which will be transferred over to /destination_directory inside HDFS.
You can see in the screenshot above that a few files which were copied to /source_directory were copied over successfully to /destination_directory inside HDFS.
At the same time you can find the real-time statistics from the NiFi interface about the transfer and DataFlow.
Once you are done with the DataFlow, simply select all the processors and connections. Click on the “stop” icon on the top to stop the processing.
Using Hortonworks DataFlow (HDF) – Powered by Apache NiFi, we can design some really complex DataFlows. I have designed one such semi-complex DataFlow which transfers the files from AWS S3 bucket.
I will be writing an article around setting that up. Stay tuned for that!