Several organizations have now started to adopt Bigdata based solutions for various
Bigdata Analysis. One of the hardest
parts in any Bigdata project is to get data from multiple sources like ERP, CRM,
Files, HTTP Links, IoT data, etc. into the
Bigdata platform for different analysis which is the “real work.” To ingest data
from various sources into the Bigdata
platform for further analysis needs a well-rounded, scalable, fault tolerant
solution to handle entire “data flow” logistics
of an enterprise. A single platform supporting for many different sources having the
ability for both, batch as well as
streaming data ingestion into the platform is the need of the hour, and every
enterprise is looking for such a solution.
Enterprises are also looking for tools and technologies through which a rapid
development can be done, supporting ease
of use for the developers, reliability in data delivery, scalability to handle large
data sets and lineage tracking.
Hence comes Apache NiFi as the tool for handling the “data flow” management for an enterprise in the Bigdata platform supporting multiple sources and targets.
What is NiFi
NiFi (short form for “Niagara Files”) is a powerful enterprise-grade dataflow
management tool that can collect, route, enrich,
transform and process data in a reliable and scalable manner. NiFi is developed by
the National Security Agency (NSA), and now it’s a
top-level Apache Project under open source license, strongly backed by Hortonworks.
NiFi is based on the concepts of Flow-Based
Essentially Apache NiFi is a comprehensive platform that is:
- For data acquisition, transportation, and guaranteed data delivery
- For data-based event processing with buffering and prioritized queuing
- Designed to accommodate highly complex and diverse data flows
- A user-friendly visual interface for development, configuration, and control
Having NiFi as the single platform for enterprise data flow gives an option for
leveraging an out of the box tool to ingest data from
various sources in a secure and governed manner which is an extremely significant
differentiator. Particularly important for the
business cases where the business wants data to be ingested from various sources to
churn out KPIs.
NiFi accelerates data availability in the data lake and hence acts as a catalyst for Big Data project execution and business value extraction. It integrates with as well as complements various Big Data projects like Apache Spark, Apache Kafka, Hadoop, Hive, MongoDB, Cassandra, etc. Such support for different Big Data components increases the adoption of NiFi by manifold in many organizations.
Some of the high-level features of NiFi includes:
- Web based UI
- Seamless experience between design, control, feedback and monitoring
- High Configurable
- Loss tolerant vs guaranteed delivery
- Low latency vs high throughput
- Dynamic prioritization
- Runtime flow modification
- Back pressure handling
- Data provenance
- Dataflow lineage tracking from start to end
- Designed for extension
- Build own custom processors
- Rapid development and effective testing
- SSL, SSH, HTTPS, encrypted content, etc.
- Multi-tenant authorization & internal authorization / policy management
NiFi runs within a JVM on a host operating system. Primary components of NiFi on JVM are:
- Web Server: Purpose of the web server is to host the HTTP based command & control APIs
- Flow Controller: It is the brain of operations. Provides threads for extensions to run on and manages the schedule of when extensions receive resources to execute
- Extensions: These are various types of extensions supported in NiFi. The critical point is that they operate and run within the JVM
- Flow File Repository: Component, where NiFi keeps track of the state of a Flow File that is currently active, is a flow. Implementation of the repository is pluggable. The default approach is a persistent Write-Ahead Log located on a specified disk partition.
- Content Repository: Place where the actual content byte of a given Flow File resides. The method is relatively simple, stores the blocks of data in the file system.
- Provenance Repository: Place where all the provenance event data is stored. Repository construct is pluggable with the default implementation being to use one or more physical disk volumes.
NiFi also possesses the ability within a cluster. Below is the sample architecture for NiFi operating in a cluster:
Each node participating in a NiFi cluster performs the same operations on data, but each operates on a different set of data. Apache Zookeeper elects a single node as the Cluster Coordinator and failover is handled automatically by Zookeeper. All the nodes in the cluster report the heartbeat and status to the Cluster Coordinator. Also, each cluster has a Primary Node again elected by Zookeeper. As an end developer or data flow manager, NiFi cluster interacts through the user interface (UI) of any node. Any change done is replicated to all the nodes in the cluster, allowing for several entry points.
Core Concepts of NiFi
Some of the core concepts of NiFi listed below:
- FlowFiles: Information in Nifi consists of 2 parts, Attributes & Payload. Flow files typically start with a default set of attributes that are then added to by additional operations. Attributes can be referenced via the NiFi expression language. The payload is typically the information itself and can also be referenced by specific processors.
- FlowFile Processors: These do all the actual work in NiFi. They’re self-contained segments of code that in most cases have inputs and outputs. One of the most common processors, GetFTP, retrieves files from an FTP server and creates a flow file. The flow file includes attributes about the directory it was extracted from — such as creation date, filename, and a payload containing the file’s contents. This flow file can then be processed by another standard processor, RouteOnAttribute. This processor looks at an incoming flow file and applies user-defined logic based on the attributes before passing it down the chain.
- Connections: These details how flow files should travel between processors. Typical connections are for success and failure, which are simple error handling for processors. Flowfiles that are processed without fault are sent to the success queue while those with problems are sent to a failure queue. Processors such as RouteOnAttribute have custom connections based on the rules created.
In this post, we learned about “data flow management” issues that an enterprise
needs to look into for a Bigdata project
implementation and hence comes the need for Apache NiFi as “data flow manager” for
an enterprise. We also learned about what
NiFi essentially is and its features. I talked about the NiFi architecture in both
standalone and cluster mode as well as gone through
different core components of NiFi.
In the upcoming sections, we will learn about how NiFi can be installed in a Windows system as well as will go through a simple data flow creation on the NiFi canvas.