SlideShare a Scribd company logo
Building (Better) Data Pipelines
using Apache Airflow
Sid Anand (@r39132)
QCon.AI 2018
1
About Me
2
Work [ed | s] @
Maintainer of
Spare time
Co-Chair for
Apache Airflow
3
What is it?
4
Apache Airflow : What is it?
In a :
Airflow is a platform to
programmatically author, schedule
and monitor workflows (a.k.a. DAGs
or Directed Acyclic Graphs)
Apache Airflow
5
UI Walk-Through
6
Apache Airflow : UI Walk-through
Airflow - Authoring DAGs
7
Airflow: Visualizing a DAG
8
Airflow: Author DAGs in Python! No need to bundle many XML files!
Airflow - Authoring DAGs
9
Airflow: The Tree View offers a view of DAG Runs over time!
Airflow - Authoring DAGs
Airflow - Performance Insights
10
Airflow: Gantt charts reveal the slowest tasks for a run!
11
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
Apache Airflow
12
Why use it?
13
Apache Airflow : Why use it?
When would you use a Workflow Scheduler like
Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines
• Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron)
• DB Back-ups, Scheduled code/config deployment
14
What should a Workflow Scheduler do well?
• Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs
• E.g. Alerting if time or correctness SLAs are not met
• Easily scale for growing load
Apache Airflow : Why use it?
15
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
Apache Airflow : Why use it?
Use-Case : Message
Scoring
Batch Pipeline Architecture
16
Use-Case : Message Scoring
17
enterprise A
enterprise B
enterprise C
S3
S3 uploads every 15
minutes
Use-Case : Message Scoring
18
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour
Use-Case : Message Scoring
19
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
20
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
Use-Case : Message Scoring
21
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
22
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
23
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
24
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airflow manages the entire process
Use-Case : Message Scoring
25
Airflow DAG
Apache Airflow
26
Incubating
27
Apache Airflow : Incubating
Timeline
• Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin
• Max launched it @ Hadoop Summit in Summer 2015
• On 3/31/2016, Airflow —> Apache Incubator
Today
• 2400+ Forks
• 7600+ GitHub Stars
• 430+ Contributors
• 150+ companies officially using it!
• 14 Committers/Maintainers <— We’re growing here
Thank You!
28
Apache Airflow
29
Behind the Scenes
30
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
It ships with a
• DAG Scheduler
• Web application (UI)
• Powerful CLI
• Celery Workers!
Apache Airflow : Behind the Scenes
31
Apache Airflow : Behind the Scenes
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
Celery / RabbitMQ
32
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
33
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
34
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks from RabbitMQ
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
Thank You!
35

More Related Content

PDF
Airflow presentation
PPTX
Airflow presentation
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
PPTX
Airflow 101
PDF
Apache airflow
PPTX
Airflow - a data flow engine
PDF
Introducing Apache Airflow and how we are using it
PDF
Airflow introduction
Airflow presentation
Airflow presentation
Introduction to Apache Airflow - Data Day Seattle 2016
Airflow 101
Apache airflow
Airflow - a data flow engine
Introducing Apache Airflow and how we are using it
Airflow introduction

What's hot (20)

PPTX
Apache Airflow overview
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
PDF
Apache Airflow
PPTX
Apache airflow
PDF
Apache Airflow
PDF
Building an analytics workflow using Apache Airflow
PPTX
Apache Airflow in Production
PDF
Airflow for Beginners
PDF
Apache Airflow
PDF
Introduction to Apache Airflow
PPTX
Apache Airflow Introduction
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PDF
Monitoring with Prometheus
PDF
Apache Airflow Architecture
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
From airflow to google cloud composer
PDF
Airflow at lyft for Airflow summit 2020 conference
Apache Airflow overview
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Apache Airflow
Apache airflow
Apache Airflow
Building an analytics workflow using Apache Airflow
Apache Airflow in Production
Airflow for Beginners
Apache Airflow
Introduction to Apache Airflow
Apache Airflow Introduction
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
How I learned to time travel, or, data pipelining and scheduling with Airflow
Airflow Best Practises & Roadmap to Airflow 2.0
Monitoring with Prometheus
Apache Airflow Architecture
Orchestrating workflows Apache Airflow on GCP & AWS
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
From airflow to google cloud composer
Airflow at lyft for Airflow summit 2020 conference
Ad

Similar to Building Better Data Pipelines using Apache Airflow (20)

PDF
Building Automated Data Pipelines with Airflow.pdf
PDF
Cloud Native Data Pipelines (DataEngConf SF 2017)
PDF
Airflow @ Agari
PDF
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
PPTX
20171122 aws usergrp_coretech-spn-cicd-aws-v01
PDF
Cloud Native Data Pipelines (GoTo Chicago 2017)
PPTX
Airflow at lyft
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
PDF
Cloud Native Data Pipelines
PDF
Airflow Intro-1.pdf
PDF
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
PPTX
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
PPTX
Introduction to Apache Airflow & Workflow Orchestration.pptx
PDF
Airflow techtonic template
PDF
Why Airflow? & What's new in Airflow 2.3?
PPTX
Apache Airflow presentation by GenPPT.pptx
PDF
Serverless GraphQL for Product Developers
PDF
Resilient Predictive Data Pipelines (QCon London 2016)
PDF
quickguide-einnovator-4-cloudfoundry
Building Automated Data Pipelines with Airflow.pdf
Cloud Native Data Pipelines (DataEngConf SF 2017)
Airflow @ Agari
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
20171122 aws usergrp_coretech-spn-cicd-aws-v01
Cloud Native Data Pipelines (GoTo Chicago 2017)
Airflow at lyft
Apache AirfowAsaSAsaSAsSas - Session1.pptx
Cloud Native Data Pipelines
Airflow Intro-1.pdf
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
Unified, Efficient, and Portable Data Processing with Apache Beam
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Introduction to Apache Airflow & Workflow Orchestration.pptx
Airflow techtonic template
Why Airflow? & What's new in Airflow 2.3?
Apache Airflow presentation by GenPPT.pptx
Serverless GraphQL for Product Developers
Resilient Predictive Data Pipelines (QCon London 2016)
quickguide-einnovator-4-cloudfoundry
Ad

More from Sid Anand (20)

PDF
Building High Fidelity Data Streams (QCon London 2023)
PDF
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
PDF
Low Latency Fraud Detection & Prevention
PDF
YOW! Data Keynote (2021)
PDF
Big Data, Fast Data @ PayPal (YOW 2018)
PPTX
Cloud Native Predictive Data Pipelines (micro talk)
PDF
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
PDF
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
PPTX
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
PPTX
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
PPTX
Building a Modern Website for Scale (QCon NY 2013)
PDF
Hands On with Maven
PDF
Learning git
PDF
LinkedIn Data Infrastructure Slides (Version 2)
PDF
LinkedIn Data Infrastructure (QCon London 2012)
PPTX
Linked in nosql_atnetflix_2012_v1
PDF
Keeping Movies Running Amid Thunderstorms!
PDF
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
PDF
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
PPTX
Svccg nosql 2011_v4
Building High Fidelity Data Streams (QCon London 2023)
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Low Latency Fraud Detection & Prevention
YOW! Data Keynote (2021)
Big Data, Fast Data @ PayPal (YOW 2018)
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Building a Modern Website for Scale (QCon NY 2013)
Hands On with Maven
Learning git
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure (QCon London 2012)
Linked in nosql_atnetflix_2012_v1
Keeping Movies Running Amid Thunderstorms!
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Svccg nosql 2011_v4

Recently uploaded (20)

PDF
Understanding Forklifts - TECH EHS Solution
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Best Practices for Rolling Out Competency Management Software.pdf
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
medical staffing services at VALiNTRY
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Become an Agentblazer Champion Challenge Kickoff
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
PPTX
ai tools demonstartion for schools and inter college
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
FLIGHT TICKET RESERVATION SYSTEM | FLIGHT BOOKING ENGINE API
PPTX
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
Understanding Forklifts - TECH EHS Solution
Upgrade and Innovation Strategies for SAP ERP Customers
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
The Five Best AI Cover Tools in 2025.docx
Softaken Excel to vCard Converter Software.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
VVF-Customer-Presentation2025-Ver1.9.pptx
Best Practices for Rolling Out Competency Management Software.pdf
ISO 45001 Occupational Health and Safety Management System
medical staffing services at VALiNTRY
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo POS Development Services by CandidRoot Solutions
Become an Agentblazer Champion Challenge Kickoff
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
ai tools demonstartion for schools and inter college
How Creative Agencies Leverage Project Management Software.pdf
FLIGHT TICKET RESERVATION SYSTEM | FLIGHT BOOKING ENGINE API
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx

Building Better Data Pipelines using Apache Airflow

  • 1. Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1
  • 2. About Me 2 Work [ed | s] @ Maintainer of Spare time Co-Chair for
  • 4. 4 Apache Airflow : What is it? In a : Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs)
  • 6. 6 Apache Airflow : UI Walk-through
  • 7. Airflow - Authoring DAGs 7 Airflow: Visualizing a DAG
  • 8. 8 Airflow: Author DAGs in Python! No need to bundle many XML files! Airflow - Authoring DAGs
  • 9. 9 Airflow: The Tree View offers a view of DAG Runs over time! Airflow - Authoring DAGs
  • 10. Airflow - Performance Insights 10 Airflow: Gantt charts reveal the slowest tasks for a run!
  • 11. 11 Airflow: …And we can easily see performance trends over time Airflow - Performance Insights
  • 13. 13 Apache Airflow : Why use it? When would you use a Workflow Scheduler like Airflow? • ETL Pipelines • Machine Learning Pipelines • Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification, Recommender System, etc… • General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
  • 14. 14 What should a Workflow Scheduler do well? • Schedule a graph of dependencies • where Workflow = A DAG of Tasks • Handle task failures • Report / Alert on failures • Monitor performance of tasks over time • Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met • Easily scale for growing load Apache Airflow : Why use it?
  • 15. 15 What Does Apache Airflow Add? • Configuration-as-code • Usability - Stunning UI / UX • Centralized configuration • Resource Pooling • Extensibility Apache Airflow : Why use it?
  • 16. Use-Case : Message Scoring Batch Pipeline Architecture 16
  • 17. Use-Case : Message Scoring 17 enterprise A enterprise B enterprise C S3 S3 uploads every 15 minutes
  • 18. Use-Case : Message Scoring 18 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour
  • 19. Use-Case : Message Scoring 19 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
  • 20. Use-Case : Message Scoring 20 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
  • 21. Use-Case : Message Scoring 21 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
  • 22. 22 enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 23. 23 enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 24. 24 enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
  • 27. 27 Apache Airflow : Incubating Timeline • Airflow was created @ Airbnb in 2015 by Maxime Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here
  • 30. 30 Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs) It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers! Apache Airflow : Behind the Scenes
  • 31. 31 Apache Airflow : Behind the Scenes Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ
  • 32. 32 Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 33. 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery 33 Webserver Scheduler WorkerWorkerWorker Meta DB Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 34. 34 Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks from RabbitMQ Celery / RabbitMQ Apache Airflow : Behind the Scenes