SlideShare a Scribd company logo
Introduction to DataFlow
management using Apache NiFi
Presented by: Anshuman Ghosh
Topics we will cover
 DataFlow and problems.
 What is Apache NiFi – History, key features, core components
 Architecture To start with NiFi (Single server setup)
 Architecture To scale with NiFi (NiFi cluster setup)
 Fundamentals of NiFi Web UI
 Building a NiFi DataFlow Processor
 Live demo
 Testing
 Deployment and automation
 What next?
 Q&A
DataFlow
 The term “DataFlow” can be used in variety of contexts.
 In our context it is the flow of information between systems.
 It is crucial to have a robust platform to create, manage and automate the
flow of enterprise data.
 There are many tools for data gathering and data flow, but more often
than not we lack an integrated platform for that.
 Probably an ideal situation would be have a seamless integration ,..
What enterprises look for
To be able to get data from any source
… To the systems that performs Analytics
… And to those for user availability
Common DataFlow challenges
 System failure
 Difference between data production and consumption
 Change in dynamic data priority
 Protocols and format changes; new systems, new protocols
 Need of bidirectional data flow
 Transparency and control
 Security and privacy
Brief history of Apache NiFi
 Developed at NSA (National Security Agency, USA) for over 8 years.
 Onyara engineers, for NSA, have developed a project called “Niagara
Files” which later went on to become NiFi.
 Trough NSA Technology transfer program it was made available as an open
source Apache project “Apache NiFi” in the year 2014.
 Hortonworks has a partnership with Onyara on their “Hortonworks DataFlow
powered by Apache NiFi”
What is Apache NiFi
 Holistically Apache NiFi is an integrated platform to collect, conduct and
curate real-time data (data in motion).
 Provides an end to end DataFlow management from any source* to any
destination*.
 Provides data logistics – real-time operational visibility and control of
DataFlow.
 Supports powerful and scalable directed graphs of data routing and data
transformation.
 All these in a reliable and secure manner.
*complete list of source and destination on official documentation
Key features
 Guaranteed data delivery – “at least once” semantics
 Data buffering and Back pressure
 Data prioritization in queue
 Flow specific setting for “latency vs. throughput”
 Data provenance
 Visual control
 Flow templates
 Recovery/ Recording through content repository
 Clustering to scale-out
 Security
 Classloader Isolation
Core components of NiFi
 NiFi at it’s core follow the concept of Flow Based programming.
 Core components of NiFi are
 FlowFile – the unit of information packet
 FlowFile Processor – the processing engine; black box.
 Connection – the relation between Processors and bounded buffer.
 Flow Controller – the scheduler in real world.
 Process Group – the compact function or subnet
Core components diagram
 This is how a typical NiFi DataFlow might look
NiFi Architecture
 NiFi executes within a JVM on a host Operating System.
NiFi Architecture – Clustering
 Typical NiFi cluster
Core components of NiFi Cluster
 NiFi Cluster Manager
 Nodes
 Primary Node
 Isolated Processors
 Heartbeats
Fundamentals of the Web UI
Building a DataFlow Processor
 Drag the “Processor” icon from “Component Toolbar” into the canvas; this
will provide a ‘Add Processor’ wizard
Building a DataFlow Processor
 General ‘SETTINGS’ for the processor
Building a DataFlow Processor
 ‘SCHEDULING’ information
Building a DataFlow Processor
 Setting up mandatory and optional ‘PROPERTIES’
Building a DataFlow Processor
 Auto alert mechanism
 If there is an error it will not allow to start the processor
Building a DataFlow Processor
 If everything is se, we are ready to initiate/ start the process
Demo 1
 In this demo, we will go through a NiFi DataFlow that deals with the
following steps
 Connect to Kafka and consume from a topic.
 Store consumed data in a local storage (optional).
 Anonymize IP address.
 Merge content before writing to HDFS (small file issues).
 Finally store Kafka data onto HDFS
 Look into error handling.
 Look into use of expression language.
Introduction to data flow management using apache nifi
Demo 2
 In this demo, we will go through a NiFi DataFlow that deals with the
following steps
 Collect/ fetch data files from a local location.
 Update/ add attributes.
 Parse JSON strings to DB Insert statements.
 Connect to PostgreSQL and Insert.
 Error handling.
Introduction to data flow management using apache nifi
Unit testing components
 For component testing nifi-mock module can be used with JUnit.
 The TestRunner interface allows us to test Processors and Controller Services.
 We need to instantiate and get a new TestRunner (org.apache.nifi.util)
 Add Controller Services and configure
 Set property of Processors setProperty(PropertyDescriptor, String)
 Enqueue FlowFiles by using the enqueue methods of the TestRunner class.
 Processor can be started by triggering run() method of TestRunner.
 Validate output – using the TestRunners assertAllFlowFilesTransferred and
assertTransferCount methods.
 More details can be found here – https://nifi.apache.org/docs/nifi-
docs/html/developer-guide.html#testing
 Add Maven dependency
 Call static newTestRunner method of the TestRunners class
 Call addControllerService method to add controller
 Set properties by setProperty(ControllerService, PropertyDescriptor, String)
 Enable services by enableControllerService(ControllerService)
 Set processor property setProperty(PropertyDescriptor, String)
 Override enqueue method for byte[], InputStream, or Path.
 run(int); This will call methods with @OnScheduled annotation, Processor’s
onTrigger method, and then run the @OnUnscheduled and finally @OnStopped
methods.
 Validate result by assertAllFlowFilesTransferred and assertTransferCount methods.
 Access FlowFiles by calling getFlowFilesForRelationship() method
Error handling
 Following can occur
 Unexpected data format
 Network connection, disk failure
 Bug in processor
 ProcessException and all others (like null pointer)
 ProcessException – Rollback and penalize the FlowFiles
 All others – Rollback, penalize the FlowFiles and Yield the Processor
Testing automation, Deployment
 NiFi provides ‘ReST’ API for all components and entire documentation can
be found here https://nifi.apache.org/docs/nifi-docs/rest-api/index.html
 Apache NiFi Community is working to improve on this area
 We can setup the deployment in following way
 Create an application i.e. entire DataFlow in your local machine and test.
 Create a process group around that (optional though)
 Create a template. (Can be done from Web UI/ ReST API call)
 Download the template. (Can be done from Web UI/ ReST API call)
 Use ReST API call to import the template in new environment.
 Use ReST API call to Update Processors (Properties, Schedule, and Settings etc.)
 Use ReST API call to Instantiate a template
Deployment
 There can be one more option to do it.
 Copying the whole flow (flow.xml.gz) from one environment to another
 Need to copy the entire canvas.
 Need to take care of sensitive properties encryption.
What is next
 We are planning to work on the testing, deployment side and update it.
 Please read more on NiFi development here –
https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html
 And for user guide – https://nifi.apache.org/docs/nifi-docs/html/user-
guide.html
 We have carried out POCs on some of our real use cases; please find them
here
 Link HDFS data ingestion using Apache
 Link How to setup Apache NiFi
 Link Expression Language Guide
 Any questions and/ or suggestions please come by or write 
Q&A
 Questions?
Thank you!
Presented by: Anshuman Ghosh

More Related Content

PDF
Dataflow with Apache NiFi
PDF
Terraform -- Infrastructure as Code
PPTX
Apache NiFi in the Hadoop Ecosystem
PDF
Getting started with DSpace 7 REST API
PDF
NiFi Developer Guide
PDF
DSpace 7 - The Angular UI from a user’s perspective
PPTX
Integrating NiFi and Flink
PDF
OpenShift-Technical-Overview.pdf
Dataflow with Apache NiFi
Terraform -- Infrastructure as Code
Apache NiFi in the Hadoop Ecosystem
Getting started with DSpace 7 REST API
NiFi Developer Guide
DSpace 7 - The Angular UI from a user’s perspective
Integrating NiFi and Flink
OpenShift-Technical-Overview.pdf

What's hot (20)

PDF
Jenkins Workflow
PDF
OpenShift Overview
PPTX
Microsoft Azure IaaS and Terraform
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
PPTX
Apache NiFi Crash Course Intro
PDF
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
PPTX
Log management with ELK
PPTX
Apache NiFi Crash Course - San Jose Hadoop Summit
PDF
Terraform modules and best-practices - September 2018
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PDF
Ansible
PDF
DSpace 7 - The Power of Configurable Entities
PPTX
Apache Atlas: Governance for your Data
PDF
Nifi workshop
PDF
Best Practices of Infrastructure as Code with Terraform
PPTX
Building Data Pipelines for Solr with Apache NiFi
PDF
Open vSwitch 패킷 처리 구조
PPTX
ODP
Openshift Container Platform
Jenkins Workflow
OpenShift Overview
Microsoft Azure IaaS and Terraform
Building robust CDC pipeline with Apache Hudi and Debezium
Introduction to Apache NiFi dws19 DWS - DC 2019
Apache NiFi Crash Course Intro
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Log management with ELK
Apache NiFi Crash Course - San Jose Hadoop Summit
Terraform modules and best-practices - September 2018
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Ansible
DSpace 7 - The Power of Configurable Entities
Apache Atlas: Governance for your Data
Nifi workshop
Best Practices of Infrastructure as Code with Terraform
Building Data Pipelines for Solr with Apache NiFi
Open vSwitch 패킷 처리 구조
Openshift Container Platform
Ad

Viewers also liked (20)

PPTX
Real-Time Data Flows with Apache NiFi
PDF
Streamsets and spark
PDF
2015 Internet Trends Report
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
PDF
[OracleCode SF] In memory analytics with apache spark and hazelcast
PDF
Tracxn Research - Finance & Accounting Landscape, February 2017
PDF
Tracxn Research - Construction Tech Landscape, February 2017
PPTX
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
PDF
Akka-chan's Survival Guide for the Streaming World
PDF
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
PPTX
2017 biological databases_part1_vupload
PPTX
Apache NiFi- MiNiFi meetup Slides
PDF
3P Learning (3PL) - Earning from Learning - equity research initiation report
PPTX
Comparing 30 MongoDB operations with Oracle SQL statements
PDF
Tracxn Research - Healthcare Analytics Landscape, February 2017
PDF
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
PDF
Tracxn Research - Insurance Tech Landscape, February 2017
Real-Time Data Flows with Apache NiFi
Streamsets and spark
2015 Internet Trends Report
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
[OracleCode SF] In memory analytics with apache spark and hazelcast
Tracxn Research - Finance & Accounting Landscape, February 2017
Tracxn Research - Construction Tech Landscape, February 2017
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Akka-chan's Survival Guide for the Streaming World
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Hadoop Summit Tokyo Apache NiFi Crash Course
2017 biological databases_part1_vupload
Apache NiFi- MiNiFi meetup Slides
3P Learning (3PL) - Earning from Learning - equity research initiation report
Comparing 30 MongoDB operations with Oracle SQL statements
Tracxn Research - Healthcare Analytics Landscape, February 2017
Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...
Tracxn Research - Insurance Tech Landscape, February 2017
Ad

Similar to Introduction to data flow management using apache nifi (20)

PDF
Apache Nifi Crash Course
PDF
Apache Nifi Crash Course
PPTX
Integração de Dados com Apache NIFI - Marco Garcia Cetax
PPTX
Connecting the Drops with Apache NiFi & Apache MiNiFi
PDF
Data ingestion and distribution with apache NiFi
PPTX
NJ Hadoop Meetup - Apache NiFi Deep Dive
PDF
Devnexus 2018 - Let Your Data Flow with Apache NiFi
PPTX
State of the Apache NiFi Ecosystem & Community
PDF
WarsawITDays_ ApacheNiFi202
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
PDF
Joe Witt presentation on Apache NiFi
PDF
Automate your data flows with Apache NIFI
PDF
Apache NiFi User Guide
PPTX
Apache NiFi Course PPT for Basic Reference
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
PDF
Dataflow Management From Edge to Core with Apache NiFi
PPTX
Hortonworks Data in Motion Webinar Series - Part 1
PPTX
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
PDF
ApacheCon 2021: Apache NiFi 101- introduction and best practices
PPTX
HDF Powered by Apache NiFi Introduction
Apache Nifi Crash Course
Apache Nifi Crash Course
Integração de Dados com Apache NIFI - Marco Garcia Cetax
Connecting the Drops with Apache NiFi & Apache MiNiFi
Data ingestion and distribution with apache NiFi
NJ Hadoop Meetup - Apache NiFi Deep Dive
Devnexus 2018 - Let Your Data Flow with Apache NiFi
State of the Apache NiFi Ecosystem & Community
WarsawITDays_ ApacheNiFi202
Best practices and lessons learnt from Running Apache NiFi at Renault
Joe Witt presentation on Apache NiFi
Automate your data flows with Apache NIFI
Apache NiFi User Guide
Apache NiFi Course PPT for Basic Reference
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Dataflow Management From Edge to Core with Apache NiFi
Hortonworks Data in Motion Webinar Series - Part 1
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
ApacheCon 2021: Apache NiFi 101- introduction and best practices
HDF Powered by Apache NiFi Introduction

Recently uploaded (20)

PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
PDF
Data Analyst Certificate Programs for Beginners | IABAC
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Quality review (1)_presentation of this 21
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Reliability_Chapter_ presentation 1221.5784
Business Acumen Training GuidePresentation.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
climate analysis of Dhaka ,Banglades.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
Data Analyst Certificate Programs for Beginners | IABAC
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
IB Computer Science - Internal Assessment.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction-to-Cloud-ComputingFinal.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Quality review (1)_presentation of this 21

Introduction to data flow management using apache nifi

  • 1. Introduction to DataFlow management using Apache NiFi Presented by: Anshuman Ghosh
  • 2. Topics we will cover  DataFlow and problems.  What is Apache NiFi – History, key features, core components  Architecture To start with NiFi (Single server setup)  Architecture To scale with NiFi (NiFi cluster setup)  Fundamentals of NiFi Web UI  Building a NiFi DataFlow Processor  Live demo  Testing  Deployment and automation  What next?  Q&A
  • 3. DataFlow  The term “DataFlow” can be used in variety of contexts.  In our context it is the flow of information between systems.  It is crucial to have a robust platform to create, manage and automate the flow of enterprise data.  There are many tools for data gathering and data flow, but more often than not we lack an integrated platform for that.  Probably an ideal situation would be have a seamless integration ,..
  • 4. What enterprises look for To be able to get data from any source … To the systems that performs Analytics … And to those for user availability
  • 5. Common DataFlow challenges  System failure  Difference between data production and consumption  Change in dynamic data priority  Protocols and format changes; new systems, new protocols  Need of bidirectional data flow  Transparency and control  Security and privacy
  • 6. Brief history of Apache NiFi  Developed at NSA (National Security Agency, USA) for over 8 years.  Onyara engineers, for NSA, have developed a project called “Niagara Files” which later went on to become NiFi.  Trough NSA Technology transfer program it was made available as an open source Apache project “Apache NiFi” in the year 2014.  Hortonworks has a partnership with Onyara on their “Hortonworks DataFlow powered by Apache NiFi”
  • 7. What is Apache NiFi  Holistically Apache NiFi is an integrated platform to collect, conduct and curate real-time data (data in motion).  Provides an end to end DataFlow management from any source* to any destination*.  Provides data logistics – real-time operational visibility and control of DataFlow.  Supports powerful and scalable directed graphs of data routing and data transformation.  All these in a reliable and secure manner. *complete list of source and destination on official documentation
  • 8. Key features  Guaranteed data delivery – “at least once” semantics  Data buffering and Back pressure  Data prioritization in queue  Flow specific setting for “latency vs. throughput”  Data provenance  Visual control  Flow templates  Recovery/ Recording through content repository  Clustering to scale-out  Security  Classloader Isolation
  • 9. Core components of NiFi  NiFi at it’s core follow the concept of Flow Based programming.  Core components of NiFi are  FlowFile – the unit of information packet  FlowFile Processor – the processing engine; black box.  Connection – the relation between Processors and bounded buffer.  Flow Controller – the scheduler in real world.  Process Group – the compact function or subnet
  • 10. Core components diagram  This is how a typical NiFi DataFlow might look
  • 11. NiFi Architecture  NiFi executes within a JVM on a host Operating System.
  • 12. NiFi Architecture – Clustering  Typical NiFi cluster
  • 13. Core components of NiFi Cluster  NiFi Cluster Manager  Nodes  Primary Node  Isolated Processors  Heartbeats
  • 15. Building a DataFlow Processor  Drag the “Processor” icon from “Component Toolbar” into the canvas; this will provide a ‘Add Processor’ wizard
  • 16. Building a DataFlow Processor  General ‘SETTINGS’ for the processor
  • 17. Building a DataFlow Processor  ‘SCHEDULING’ information
  • 18. Building a DataFlow Processor  Setting up mandatory and optional ‘PROPERTIES’
  • 19. Building a DataFlow Processor  Auto alert mechanism  If there is an error it will not allow to start the processor
  • 20. Building a DataFlow Processor  If everything is se, we are ready to initiate/ start the process
  • 21. Demo 1  In this demo, we will go through a NiFi DataFlow that deals with the following steps  Connect to Kafka and consume from a topic.  Store consumed data in a local storage (optional).  Anonymize IP address.  Merge content before writing to HDFS (small file issues).  Finally store Kafka data onto HDFS  Look into error handling.  Look into use of expression language.
  • 23. Demo 2  In this demo, we will go through a NiFi DataFlow that deals with the following steps  Collect/ fetch data files from a local location.  Update/ add attributes.  Parse JSON strings to DB Insert statements.  Connect to PostgreSQL and Insert.  Error handling.
  • 25. Unit testing components  For component testing nifi-mock module can be used with JUnit.  The TestRunner interface allows us to test Processors and Controller Services.  We need to instantiate and get a new TestRunner (org.apache.nifi.util)  Add Controller Services and configure  Set property of Processors setProperty(PropertyDescriptor, String)  Enqueue FlowFiles by using the enqueue methods of the TestRunner class.  Processor can be started by triggering run() method of TestRunner.  Validate output – using the TestRunners assertAllFlowFilesTransferred and assertTransferCount methods.  More details can be found here – https://nifi.apache.org/docs/nifi- docs/html/developer-guide.html#testing
  • 26.  Add Maven dependency  Call static newTestRunner method of the TestRunners class  Call addControllerService method to add controller  Set properties by setProperty(ControllerService, PropertyDescriptor, String)  Enable services by enableControllerService(ControllerService)  Set processor property setProperty(PropertyDescriptor, String)  Override enqueue method for byte[], InputStream, or Path.  run(int); This will call methods with @OnScheduled annotation, Processor’s onTrigger method, and then run the @OnUnscheduled and finally @OnStopped methods.  Validate result by assertAllFlowFilesTransferred and assertTransferCount methods.  Access FlowFiles by calling getFlowFilesForRelationship() method
  • 27. Error handling  Following can occur  Unexpected data format  Network connection, disk failure  Bug in processor  ProcessException and all others (like null pointer)  ProcessException – Rollback and penalize the FlowFiles  All others – Rollback, penalize the FlowFiles and Yield the Processor
  • 28. Testing automation, Deployment  NiFi provides ‘ReST’ API for all components and entire documentation can be found here https://nifi.apache.org/docs/nifi-docs/rest-api/index.html  Apache NiFi Community is working to improve on this area  We can setup the deployment in following way  Create an application i.e. entire DataFlow in your local machine and test.  Create a process group around that (optional though)  Create a template. (Can be done from Web UI/ ReST API call)  Download the template. (Can be done from Web UI/ ReST API call)  Use ReST API call to import the template in new environment.  Use ReST API call to Update Processors (Properties, Schedule, and Settings etc.)  Use ReST API call to Instantiate a template
  • 29. Deployment  There can be one more option to do it.  Copying the whole flow (flow.xml.gz) from one environment to another  Need to copy the entire canvas.  Need to take care of sensitive properties encryption.
  • 30. What is next  We are planning to work on the testing, deployment side and update it.  Please read more on NiFi development here – https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html  And for user guide – https://nifi.apache.org/docs/nifi-docs/html/user- guide.html  We have carried out POCs on some of our real use cases; please find them here  Link HDFS data ingestion using Apache  Link How to setup Apache NiFi  Link Expression Language Guide  Any questions and/ or suggestions please come by or write 
  • 32. Thank you! Presented by: Anshuman Ghosh