SlideShare a Scribd company logo
4
Most read
10
Most read
11
Most read
Apache airflow
Pavel Alexeev, Taskdata, 2019
Overview
● Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and
monitor workflows.
● When workflows are defined as code, they become more maintainable, versionable, testable,
and collaborative.
● Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow
scheduler executes your tasks on an array of
workers while following the specified
dependencies.
● Rich command line utilities make performing
complex surgeries on DAGs a snap.
● The rich user interface makes it easy to visualize
pipelines running in production, monitor progress,
and troubleshoot issues when needed.
Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can
exchange metadata!). Airflow is not in the Spark Streaming or Storm space, it is more comparable to
Oozie or Azkaban.
Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks
in your workflow as slightly more dynamic than a database structure would be. Airflow workflows are
expected to look similar from a run to the next, this allows for clarity around unit of work and continuity.
Beyond the horizon
Principles
● Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline
generation. This allows for writing code that instantiates pipelines dynamically.
● Extensible: Easily define your own operators, executors and extend the library so that it fits the level
of abstraction that suits your environment.
● Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of
Airflow using the powerful Jinja templating engine.
● Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary
number of workers.
Concepts
In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their
relationships and dependencies.
For example, a simple DAG could consist of three tasks: A, B, and C. It could say that A has to run successfully before B can run, but C
can run anytime. It could say that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. It might also say
that the workflow will run every night at 10pm, but shouldn’t start until a certain date.
In this way, a DAG describes how you want to carry out your workflow; but notice that we haven’t said anything about what we actually
want to do! A, B, and C could be anything. Maybe A prepares data for B to analyze while C sends an email. Or perhaps A monitors
your location so B can open your garage door while C turns on your house lights. The important thing is that the DAG isn’t concerned
with what its constituent tasks do; its job is to make sure that whatever they do happens at the right time, or in the right order, or with
the right handling of any unexpected issues.
DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER. Airflow will execute the code in each file to
dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In
general, each one should correspond to a single logical workflow.
Important features
Default Arguments
If a dictionary of default_args is passed to a DAG, it will apply them to any of its operators. This makes it easy to apply a common parameter to many
operators without having to type it many times.
Context Manager
DAGs can be used as context managers to automatically assign new operators to that DAG.
Operators
While DAGs describe how to run a workflow, Operators determine what actually gets done.
Hooks
Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig.
Pools
Some systems can get overwhelmed when too many processes hit them at the same time. Airflow pools can be used to limit the execution
parallelism on arbitrary sets of tasks. The list of pools is managed in the UI (Menu -> Admin -> Pools)
Important features
Connections
The connection information to external systems is stored in the Airflow metadata database and managed in the UI (Menu -> Admin ->
Connections). A conn_id is defined there and hostname / login / password / schema information attached to it. Airflow pipelines can
simply refer to the centrally managed conn_id without having to hard code any of this information anywhere.
Queues
When using the CeleryExecutor, the Celery queues that tasks are sent to can be specified. queue is an attribute of BaseOperator, so
any task can be assigned to any queue.
XComs
XComs let tasks exchange messages, allowing more nuanced forms of control and shared state. The name is an abbreviation of
“cross-communication”. XComs are principally defined by a key, value, and timestamp, but also track attributes like the task/DAG that
created the XCom and when it should become visible. Any object that can be pickled can be used as an XCom value, so users should
make sure to use objects of appropriate size. XComs can be “pushed” (sent) or “pulled” (received).
Important features
Variables
Variables are a generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow. Variables can
be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI.
Branching
Sometimes you need a workflow to branch, or only go down a certain path based on an arbitrary condition which is typically related to
something that happened in an upstream task
SubDAGs
SubDAGs are perfect for repeating patterns. Defining a function that returns a DAG object is a nice design pattern when using Airflow.
SLAs
Service Level Agreements, or time by which a task or DAG should have succeeded, can be set at a task level as a timedelta. If one or
many instances have not succeeded by that time, an alert email is sent detailing the list of tasks that missed their SLA.
Important features
Trigger Rules
Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded, Airflow allows for more
complex dependency settings. All operators have a trigger_rule argument which defines the rule by which the generated task get
triggered.
Latest Run Only
Standard workflow behavior involves running a series of tasks for a particular date/time range. Some workflows, however, perform
tasks that are independent of run time but need to be run on a schedule, much like a standard cron job. In these cases, backfills or
running jobs missed during a pause just wastes CPU cycles.
Zombies & Undeads
Task instances die all the time, usually as part of their normal life cycle, but sometimes unexpectedly.
Zombie tasks are characterized by the absence of an heartbeat (emitted by the job periodically) and a running status in the database.
They can occur when a worker node can’t reach the database, when Airflow processes are killed externally, or when a node gets
rebooted for instance. Zombie killing is performed periodically by the scheduler’s process.
Important features
Cluster Policy
Your local airflow settings file can define a policy function that has the ability to mutate task attributes based on other task or DAG
attributes. It receives a single argument as a reference to task objects, and is expected to alter its attributes.
Jinja Templating
Airflow leverages the power of Jinja Templating and this can be a powerful tool to use in combination with macros (see the Macros
reference section).
Packaged dags
While often you will specify dags in a single .py file it might sometimes be required to combine dag and its dependencies. For example,
you might want to combine several dags together to version them together or you might want to manage them together or you might
need an extra module that is not available by default on the system you are running airflow on. To allow this you can create a zip file
that contains the dag(s) in the root of the zip file and have the extra modules unpacked in directories.
Operators
While DAGs describe how to run a workflow, Operators determine what actually gets done.
Airflow provides operators for many common tasks, including:
● BashOperator - executes a bash command
● PythonOperator - calls an arbitrary Python function
● EmailOperator - sends an email
● SimpleHttpOperator - sends an HTTP request
● MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL
command
● Sensor - waits for a certain time, file, database row, S3 key, etc…
In addition to these basic building blocks, there are many more specific operators: DockerOperator, HiveOperator,
S3FileTransformOperator(, PrestoToMySqlTransfer, SlackAPIOperator… you get the idea!
Alternatives
● https://azkaban.github.io - from Linkedin
● https://oozie.apache.org - for Hadoop stack native tool

More Related Content

PPTX
Airflow presentation
PDF
Introduction to Apache Airflow
PDF
Airflow introduction
PDF
Apache Airflow
PPTX
Apache Airflow overview
PPTX
Airflow 101
PDF
Apache Airflow
PDF
Introducing Apache Airflow and how we are using it
Airflow presentation
Introduction to Apache Airflow
Airflow introduction
Apache Airflow
Apache Airflow overview
Airflow 101
Apache Airflow
Introducing Apache Airflow and how we are using it

What's hot (20)

PDF
Building an analytics workflow using Apache Airflow
PDF
Airflow presentation
PDF
Airflow for Beginners
PPTX
Airflow - a data flow engine
PDF
Apache airflow
PDF
Apache Airflow
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
PDF
Apache Airflow Architecture
PDF
Airflow Intro-1.pdf
PDF
Airflow tutorials hands_on
PPTX
Apache Airflow Introduction
PPTX
Apache Airflow in Production
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
PDF
Building Better Data Pipelines using Apache Airflow
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PPTX
Airflow at WePay
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
Databricks Delta Lake and Its Benefits
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building an analytics workflow using Apache Airflow
Airflow presentation
Airflow for Beginners
Airflow - a data flow engine
Apache airflow
Apache Airflow
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Apache Airflow Architecture
Airflow Intro-1.pdf
Airflow tutorials hands_on
Apache Airflow Introduction
Apache Airflow in Production
How I learned to time travel, or, data pipelining and scheduling with Airflow
Orchestrating workflows Apache Airflow on GCP & AWS
Building Better Data Pipelines using Apache Airflow
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow at WePay
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Databricks Delta Lake and Its Benefits
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Ad

Similar to Apache airflow (20)

PDF
Building Automated Data Pipelines with Airflow.pdf
PPTX
Apache Airdrop detailed description.pptx
PPTX
DataPipelineApacheAirflow.pptx
PPTX
airflow web UI and CLI.pptx
PPTX
airflowpresentation1-180717183432.pptx
PPSX
Introduce Airflow.ppsx
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Airflow - Insane power in a Tiny Box
PDF
Managing transactions on Ethereum with Apache Airflow
PPTX
Apache Airflow presentation by GenPPT.pptx
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
PPTX
PPTX
Fyber - airflow best practices in production
PDF
Apache Airflow® Best Practices: DAG Writing
PDF
apacheairflow-160827123852.pdf
PDF
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PPTX
Installing & Setting Up Apache Airflow (Local & Cloud) - AccentFuture
Building Automated Data Pipelines with Airflow.pdf
Apache Airdrop detailed description.pptx
DataPipelineApacheAirflow.pptx
airflow web UI and CLI.pptx
airflowpresentation1-180717183432.pptx
Introduce Airflow.ppsx
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
How I learned to time travel, or, data pipelining and scheduling with Airflow
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Airflow - Insane power in a Tiny Box
Managing transactions on Ethereum with Apache Airflow
Apache Airflow presentation by GenPPT.pptx
Apache AirfowAsaSAsaSAsSas - Session1.pptx
Fyber - airflow best practices in production
Apache Airflow® Best Practices: DAG Writing
apacheairflow-160827123852.pdf
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
Running Airflow Workflows as ETL Processes on Hadoop
Installing & Setting Up Apache Airflow (Local & Cloud) - AccentFuture
Ad

More from Pavel Alexeev (6)

PPTX
Elasticsearch features and ecosystem
PDF
High load++2016.highlights (dropbox+clickhouse)
PPTX
Matching theory
PPTX
ToroDB (highload++2015)
PPTX
Ansible+docker (highload++2015)
PPTX
Git for you
Elasticsearch features and ecosystem
High load++2016.highlights (dropbox+clickhouse)
Matching theory
ToroDB (highload++2015)
Ansible+docker (highload++2015)
Git for you

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
PDF
Mushroom cultivation and it's methods.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Getting Started with Data Integration: FME Form 101
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Encapsulation theory and applications.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative analysis of optical character recognition models for extracting...
TLE Review Electricity (Electricity).pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Heart disease approach using modified random forest and particle swarm optimi...
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology
Mushroom cultivation and it's methods.pdf
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Network Security Unit 5.pdf for BCA BBA.
OMC Textile Division Presentation 2021.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Getting Started with Data Integration: FME Form 101
cloud_computing_Infrastucture_as_cloud_p
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...

Apache airflow

  • 2. Overview ● Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. ● When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative. ● Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. ● Rich command line utilities make performing complex surgeries on DAGs a snap. ● The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
  • 3. Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!). Airflow is not in the Spark Streaming or Storm space, it is more comparable to Oozie or Azkaban. Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks in your workflow as slightly more dynamic than a database structure would be. Airflow workflows are expected to look similar from a run to the next, this allows for clarity around unit of work and continuity. Beyond the horizon
  • 4. Principles ● Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically. ● Extensible: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment. ● Elegant: Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful Jinja templating engine. ● Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers.
  • 5. Concepts In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. For example, a simple DAG could consist of three tasks: A, B, and C. It could say that A has to run successfully before B can run, but C can run anytime. It could say that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. It might also say that the workflow will run every night at 10pm, but shouldn’t start until a certain date. In this way, a DAG describes how you want to carry out your workflow; but notice that we haven’t said anything about what we actually want to do! A, B, and C could be anything. Maybe A prepares data for B to analyze while C sends an email. Or perhaps A monitors your location so B can open your garage door while C turns on your house lights. The important thing is that the DAG isn’t concerned with what its constituent tasks do; its job is to make sure that whatever they do happens at the right time, or in the right order, or with the right handling of any unexpected issues. DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.
  • 6. Important features Default Arguments If a dictionary of default_args is passed to a DAG, it will apply them to any of its operators. This makes it easy to apply a common parameter to many operators without having to type it many times. Context Manager DAGs can be used as context managers to automatically assign new operators to that DAG. Operators While DAGs describe how to run a workflow, Operators determine what actually gets done. Hooks Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Pools Some systems can get overwhelmed when too many processes hit them at the same time. Airflow pools can be used to limit the execution parallelism on arbitrary sets of tasks. The list of pools is managed in the UI (Menu -> Admin -> Pools)
  • 7. Important features Connections The connection information to external systems is stored in the Airflow metadata database and managed in the UI (Menu -> Admin -> Connections). A conn_id is defined there and hostname / login / password / schema information attached to it. Airflow pipelines can simply refer to the centrally managed conn_id without having to hard code any of this information anywhere. Queues When using the CeleryExecutor, the Celery queues that tasks are sent to can be specified. queue is an attribute of BaseOperator, so any task can be assigned to any queue. XComs XComs let tasks exchange messages, allowing more nuanced forms of control and shared state. The name is an abbreviation of “cross-communication”. XComs are principally defined by a key, value, and timestamp, but also track attributes like the task/DAG that created the XCom and when it should become visible. Any object that can be pickled can be used as an XCom value, so users should make sure to use objects of appropriate size. XComs can be “pushed” (sent) or “pulled” (received).
  • 8. Important features Variables Variables are a generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow. Variables can be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI. Branching Sometimes you need a workflow to branch, or only go down a certain path based on an arbitrary condition which is typically related to something that happened in an upstream task SubDAGs SubDAGs are perfect for repeating patterns. Defining a function that returns a DAG object is a nice design pattern when using Airflow. SLAs Service Level Agreements, or time by which a task or DAG should have succeeded, can be set at a task level as a timedelta. If one or many instances have not succeeded by that time, an alert email is sent detailing the list of tasks that missed their SLA.
  • 9. Important features Trigger Rules Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded, Airflow allows for more complex dependency settings. All operators have a trigger_rule argument which defines the rule by which the generated task get triggered. Latest Run Only Standard workflow behavior involves running a series of tasks for a particular date/time range. Some workflows, however, perform tasks that are independent of run time but need to be run on a schedule, much like a standard cron job. In these cases, backfills or running jobs missed during a pause just wastes CPU cycles. Zombies & Undeads Task instances die all the time, usually as part of their normal life cycle, but sometimes unexpectedly. Zombie tasks are characterized by the absence of an heartbeat (emitted by the job periodically) and a running status in the database. They can occur when a worker node can’t reach the database, when Airflow processes are killed externally, or when a node gets rebooted for instance. Zombie killing is performed periodically by the scheduler’s process.
  • 10. Important features Cluster Policy Your local airflow settings file can define a policy function that has the ability to mutate task attributes based on other task or DAG attributes. It receives a single argument as a reference to task objects, and is expected to alter its attributes. Jinja Templating Airflow leverages the power of Jinja Templating and this can be a powerful tool to use in combination with macros (see the Macros reference section). Packaged dags While often you will specify dags in a single .py file it might sometimes be required to combine dag and its dependencies. For example, you might want to combine several dags together to version them together or you might want to manage them together or you might need an extra module that is not available by default on the system you are running airflow on. To allow this you can create a zip file that contains the dag(s) in the root of the zip file and have the extra modules unpacked in directories.
  • 11. Operators While DAGs describe how to run a workflow, Operators determine what actually gets done. Airflow provides operators for many common tasks, including: ● BashOperator - executes a bash command ● PythonOperator - calls an arbitrary Python function ● EmailOperator - sends an email ● SimpleHttpOperator - sends an HTTP request ● MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL command ● Sensor - waits for a certain time, file, database row, S3 key, etc… In addition to these basic building blocks, there are many more specific operators: DockerOperator, HiveOperator, S3FileTransformOperator(, PrestoToMySqlTransfer, SlackAPIOperator… you get the idea!
  • 12. Alternatives ● https://azkaban.github.io - from Linkedin ● https://oozie.apache.org - for Hadoop stack native tool

Editor's Notes

  • #3: https://hub.docker.com/r/apache/airflow
  • #4: https://hub.docker.com/r/apache/airflow
  • #6: https://airflow.apache.org/concepts.html
  • #12: https://airflow.apache.org/concepts.html#operators