SlideShare a Scribd company logo
Powering
machine learning workflows with
Apache Airflow and Python
@tati_alchueyr
OctopusCon Python Edition
Kharkiv, 16 November 2019
мне приятно быть здесь с
тобой
большое Вам спасибо
Image credit: Turkish Airlines
tati_alchueyr.__doc__
● Brazilian living in London since 2014
● Senior Data Engineer at the BBC Datalab team
● Graduated in Computer Engineering at Unicamp
● Passionate software developer for 16 years
● Experience in the private and public sectors
● Developed software for Medicine, Media and Education
● Loves Open Source
● Loves Brazilian Jiu Jitsu
● Proud mother of Amanda
Software InVesalius 3
Distance from
Rio de Janeiro to London:
9,272 km
Distance from
London to Kharkiv: 2,899 km
help(bbc)
● British Broadcasting Corporation
● Values
○ Independent, impartial and honest
○ Audiences are at the heart of everything we do
○ We take pride in delivering quality and value for money
○ Creativity is the lifeblood of our organisation
○ We respect each other and celebrate our diversity so
that everyone can give their best
● Purpose
○ Inform
○ Educate
○ Entertain
New Broadcasting House
London, UK
bbc.stats()
➢ BBC TV reaches 91% UK adult population
➢ BBC News reaches 426 million global audience weekly
Reference 1: BBC
Reference 2: BBC
Image Credit: BBC
bbc. .mission
“Bring the BBC’s data together
accessible through a common platform,
along with flexible and scalable tools to
support machine learning to enable
content enrichment and deeper
personalisation”
Some of the Datalab team members (15 August 2019)
bbc. .mission
Have you ever
built or used a machine learning
model?
no yes ?
machine learning application personalisation
Image credit: BBC
machine learning application content creation
Image credit: BBCMade by the Machine: when AI met the archive (BBC 4)
machine learning training & prediction supervised
machine learning tools jupyter notebooks
Image credit: Jupyter.org
BBC Datalab
ML course
machine learning tools python scripts
machine learning tools (remote) servers
machine learning tools containers
Image credit: RedHat
have you ever
scheduled processes?
no yes ?
scheduling humans
scheduling processes cron jobs
0 2 * * * /bin/sh backup.sh
scheduling processes k8s jobs
apiVersion: batch/v1
kind: Job
metadata:
generateName: backup-
namespace: sample
spec:
template:
spec:
containers:
- name: backup
image: alpine:3.9
command: ["sh", "-c", "backup.sh"]
backoffLimit: 3
scheduling workflows cron jobs
Several cron jobs running...
It seems a critical job didn’t run
last night...
Didn’t it run? Did it fail?
Why could it have failed?
Original image credit: XKCD
scheduling workflows tools
Azkaban
●
how much do you know about
Apache Airflow?
none basic mid high
airflow
Image credit: House Plans
air(bnb)flow
Airflow release blog post by Airbnb
https://github.com/apache/airflow Image credit: Airbnb
airflow why
● Handle complex relationships between jobs
● Handle all the jobs centrally with a well defined user
interface
● Error reporting and alerting
● Viewing and analyzing job run times
● Security (protecting credentials of databases)
airflow why not
● In many cases, cron jobs are the simplest and most
effective tool
● Airflow is a complex tool made of several components
○ Learning curve
○ Infrastructure management cost
airflow concepts (i) DAG
● All workflows are considered to be DAGs
○ DAG: Direct Acyclic Graph
nodes
direct edge
DAG
not
DAG
Job A Job B Job C
DAG
not
DAG
Job A
Job B
Job C
Job D
DAG
not
DAG
Job A
Job B
Job C
Job D
DAG
not
DAG
Job A
Job B
Job C
Job D
DAG
not
DAG
Job A
Job B
Job C
Job D
airflow concepts (ii) DAG properties
● DAGs (usually) have:
○ schedule
○ start time
○ unique name (ID)
○ nodes: jobs (instances of Operators)
○ edges: dependencies between the nodes
airflow concepts (iii) operators
● Operators define the task or job
○ BashOperator: execute shell commands/scripts
○ PythonOperator: execute Python code
○ BranchPythonOperator: execute a code if condition
○ SlackOperator
○ (...)
○ Custom operators
demo
Apache Airflow pipeline
(example with a Python operator)
airflow concepts (iv) relationships
● Edges define dependencies
○ When some tasks need to execute one after another
Image credit: Airbnb
airflow concepts (v) connections
● Connections encrypt credentials
○ The jobs do not need to worry about securing
credentials
Image credit: Airbnb
airflow concepts (vi) visualisation
Image credit: Airbnb
airflow concepts (vi) visualisation
Image credit: Airbnb
demo
Apache Airflow pipeline
(example of data ingestion at the BBC)
inside airflow
airflow architecture
airflow managed service GCP Cloud Composer
Image credit: Google
scars of experience
Image credit: XKCD
scars of experience installing python packages
scars of experience installing python packages
● When using a Python Operator, the job is run within the
worker
● Therefore, by default, Python dependencies are
installed globally to the workers
● In other words, application deployments can break your
Airflow environment
scars of experience installing python packages
● Isolate the execution from the scheduling, when
reasonable
● To debug native operators means to debug Airflow itself
● Alternatives to isolate them:
○ PythonVirtualenvOperator
○ DockerOperator
○ KubernetesPodOperator
○ GceInstanceStartOperator
Interest reading: Medium
scars of experience debugging
There was a breaking change in an
Airflow plugin, the scheduler couldn’t
process the DAG
scars of experience debugging
The DAG in the worker instances was
deleted but its metadata was no longer
available in the scheduler
scars of experience debugging
● Error messages are not always obvious
○ Understand what is happening in the system
○ The webserver and scheduler are independent
processes
scars of experience versioning can be tricky
scars of experience versioning can be tricky
scars of experience versioning can be tricky
● Log the version of the Dag Operator and Plugins when
they are run
● When catchup is enabled, new jobs will be added to
previous executions
scars of experience using xcom between jobs
scars of experience using xcom between jobs
scars of experience using xcom between jobs
● Alternatives
○ By default, the return value of the operator execute
method is stored in XCom
○ XCom values are stored in the Airflow metadata DB
○ Avoid using XCom
○ Store the state in data stores (databases, object
stores, etc)
scars of experience scheduling duration
scars of experience breaking changes
● Minor versions of Airflow can introduce breaking changes
○ Example: named parameter in S3Hook (1.8 -> 1.9)
■ aws_conn_id
■ s3_conn_id
Reference: Airflow development mailing list
where did all the magic of
machine learning workflows go?
Image credit: XKCD
have you ever
built machine learning pipelines?
no yes ?
Reference: TFX
Interest reading: Medium
airflow machine learning specifics
● Machine learning jobs are similar to usual jobs
● Factors which can affect the operator choice:
○ is the model built using the same Python version?
○ how much CPU and memory does your model need?
○ how can you make Airflow use your existing
infrastructure
○ how many concurrent workers do you need?
■ Limitation on scaling celery executors
■ Kubernetes executors in early stage
demo
Apache Airflow pipeline
(example of model building pipeline)
sample DAG model building pipeline
sample DAG model building pipeline
sample DAG model building pipeline content
sample DAG model building pipeline user data
sample DAG model building pipeline model
demo
Apache Airflow pipeline
(example of hyperparameter tuning)
sample DAG hyperparameter tuning
getting involved
Image credit: XKCD
airflow install
$ pip install apache-airflow
airflow source-code github
https://github.com/apache/airflow
airflow docs
https://airflow.apache.org/
airflow issue tracker jira
https://issues.apache.org/jira/browse?jql=project=AIRFLOW
airflow community slack
https://apache-airflow.slack.com
is Airflow the right tool for you?
Image credit: XKCD
http://datalab.rocks
find out more
дуже тобі дякую
Спасибо
Image credit: Wikipedia Commons
@tati_alchueyr
Ad

Recommended

PDF
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
PDF
Introducing Apache Airflow and how we are using it
Bruno Faria
 
PDF
Apache Airflow
Sumit Maheshwari
 
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
PPTX
Apache Airflow Introduction
Liangjun Jiang
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
PDF
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Kaxil Naik
 
PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
PDF
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
Jarek Potiuk
 
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
PDF
Introduction to Apache Airflow
mutt_data
 
PDF
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
PDF
Building Robust Pipelines with Airflow
Erin Shellman
 
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
PPTX
Airflow at WePay
Chris Riccomini
 
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
PPTX
Airflow presentation
Anant Corporation
 
PDF
Apache Airflow
Knoldus Inc.
 
PDF
From airflow to google cloud composer
Bruce Kuo
 
PDF
Airflow for Beginners
Varya Karpenko
 
PPTX
Apache Airflow overview
NikolayGrishchenkov
 
PPTX
Airflow at lyft
Tao Feng
 
PPTX
Building cloud-enabled genomics workflows with Luigi and Docker
Jacob Feala
 
PPTX
Apache airflow
Pavel Alexeev
 
PPTX
Fyber - airflow best practices in production
Itai Yaffe
 
PDF
Airflow presentation
Ilias Okacha
 
PDF
Apache airflow
Purna Chander
 
PDF
Airflow Intro-1.pdf
BagustTriCahyo1
 
PDF
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 

More Related Content

What's hot (20)

PDF
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
Jarek Potiuk
 
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
PDF
Introduction to Apache Airflow
mutt_data
 
PDF
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
PDF
Building Robust Pipelines with Airflow
Erin Shellman
 
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
PPTX
Airflow at WePay
Chris Riccomini
 
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
PPTX
Airflow presentation
Anant Corporation
 
PDF
Apache Airflow
Knoldus Inc.
 
PDF
From airflow to google cloud composer
Bruce Kuo
 
PDF
Airflow for Beginners
Varya Karpenko
 
PPTX
Apache Airflow overview
NikolayGrishchenkov
 
PPTX
Airflow at lyft
Tao Feng
 
PPTX
Building cloud-enabled genomics workflows with Luigi and Docker
Jacob Feala
 
PPTX
Apache airflow
Pavel Alexeev
 
PPTX
Fyber - airflow best practices in production
Itai Yaffe
 
PDF
Airflow presentation
Ilias Okacha
 
PDF
Apache airflow
Purna Chander
 
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
Jarek Potiuk
 
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
Introduction to Apache Airflow
mutt_data
 
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Building Robust Pipelines with Airflow
Erin Shellman
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
Airflow at WePay
Chris Riccomini
 
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
Airflow presentation
Anant Corporation
 
Apache Airflow
Knoldus Inc.
 
From airflow to google cloud composer
Bruce Kuo
 
Airflow for Beginners
Varya Karpenko
 
Apache Airflow overview
NikolayGrishchenkov
 
Airflow at lyft
Tao Feng
 
Building cloud-enabled genomics workflows with Luigi and Docker
Jacob Feala
 
Apache airflow
Pavel Alexeev
 
Fyber - airflow best practices in production
Itai Yaffe
 
Airflow presentation
Ilias Okacha
 
Apache airflow
Purna Chander
 

Similar to Powering machine learning workflows with Apache Airflow and Python (20)

PDF
Airflow Intro-1.pdf
BagustTriCahyo1
 
PDF
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Databricks
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
PPTX
DataPipelineApacheAirflow.pptx
John J Zhao
 
PDF
Airflow - Insane power in a Tiny Box
Dovy Paukstys
 
PDF
Introduction to Apache Airflow - Programmatically Manage Your Workflows for ...
Xiaodong DENG
 
PDF
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
PPTX
Airflow 101
SaarBergerbest
 
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
PPTX
Apache Airdrop detailed description.pptx
prince07031999
 
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
PPTX
Getting to Know Airflow
Rosanne Hoyem
 
PDF
Apache Airflow
Knoldus Inc.
 
PPTX
Apache Airflow in Production
Robert Sanders
 
PPSX
Introduce Airflow.ppsx
ManKD
 
PPTX
Airflow - a data flow engine
Walter Liu
 
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
 
PDF
Airflow techtonic template
Sampath Kumar
 
Airflow Intro-1.pdf
BagustTriCahyo1
 
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Databricks
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
DataPipelineApacheAirflow.pptx
John J Zhao
 
Airflow - Insane power in a Tiny Box
Dovy Paukstys
 
Introduction to Apache Airflow - Programmatically Manage Your Workflows for ...
Xiaodong DENG
 
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Airflow 101
SaarBergerbest
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
Apache Airdrop detailed description.pptx
prince07031999
 
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Getting to Know Airflow
Rosanne Hoyem
 
Apache Airflow
Knoldus Inc.
 
Apache Airflow in Production
Robert Sanders
 
Introduce Airflow.ppsx
ManKD
 
Airflow - a data flow engine
Walter Liu
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
 
Airflow techtonic template
Sampath Kumar
 
Ad

More from Tatiana Al-Chueyr (20)

PDF
PyData London - Scaling AI workloads with Ray & Airflow.pdf
Tatiana Al-Chueyr
 
PDF
dbt no Airflow: Como melhorar o seu deploy (de forma correta)
Tatiana Al-Chueyr
 
PDF
Integrating dbt with Airflow - Overcoming Performance Hurdles
Tatiana Al-Chueyr
 
PDF
Best Practices for Effectively Running dbt in Airflow
Tatiana Al-Chueyr
 
PDF
Integrating ChatGPT with Apache Airflow
Tatiana Al-Chueyr
 
PDF
Contributing to Apache Airflow
Tatiana Al-Chueyr
 
PDF
From an idea to production: building a recommender for BBC Sounds
Tatiana Al-Chueyr
 
PDF
Precomputing recommendations with Apache Beam
Tatiana Al-Chueyr
 
PDF
Scaling machine learning to millions of users with Apache Beam
Tatiana Al-Chueyr
 
PPTX
Scaling machine learning workflows with Apache Beam
Tatiana Al-Chueyr
 
PDF
Responsible machine learning at the BBC
Tatiana Al-Chueyr
 
PPTX
Responsible Machine Learning at the BBC
Tatiana Al-Chueyr
 
PDF
PyConUK 2018 - Journey from HTTP to gRPC
Tatiana Al-Chueyr
 
PDF
Sprint cPython at Globo.com
Tatiana Al-Chueyr
 
PDF
PythonBrasil[8] - CPython for dummies
Tatiana Al-Chueyr
 
PDF
QCon SP - recommended for you
Tatiana Al-Chueyr
 
PDF
Crafting APIs
Tatiana Al-Chueyr
 
PDF
PyConUK 2016 - Writing English Right
Tatiana Al-Chueyr
 
PDF
InVesalius: 3D medical imaging software
Tatiana Al-Chueyr
 
PDF
Automatic English text correction
Tatiana Al-Chueyr
 
PyData London - Scaling AI workloads with Ray & Airflow.pdf
Tatiana Al-Chueyr
 
dbt no Airflow: Como melhorar o seu deploy (de forma correta)
Tatiana Al-Chueyr
 
Integrating dbt with Airflow - Overcoming Performance Hurdles
Tatiana Al-Chueyr
 
Best Practices for Effectively Running dbt in Airflow
Tatiana Al-Chueyr
 
Integrating ChatGPT with Apache Airflow
Tatiana Al-Chueyr
 
Contributing to Apache Airflow
Tatiana Al-Chueyr
 
From an idea to production: building a recommender for BBC Sounds
Tatiana Al-Chueyr
 
Precomputing recommendations with Apache Beam
Tatiana Al-Chueyr
 
Scaling machine learning to millions of users with Apache Beam
Tatiana Al-Chueyr
 
Scaling machine learning workflows with Apache Beam
Tatiana Al-Chueyr
 
Responsible machine learning at the BBC
Tatiana Al-Chueyr
 
Responsible Machine Learning at the BBC
Tatiana Al-Chueyr
 
PyConUK 2018 - Journey from HTTP to gRPC
Tatiana Al-Chueyr
 
Sprint cPython at Globo.com
Tatiana Al-Chueyr
 
PythonBrasil[8] - CPython for dummies
Tatiana Al-Chueyr
 
QCon SP - recommended for you
Tatiana Al-Chueyr
 
Crafting APIs
Tatiana Al-Chueyr
 
PyConUK 2016 - Writing English Right
Tatiana Al-Chueyr
 
InVesalius: 3D medical imaging software
Tatiana Al-Chueyr
 
Automatic English text correction
Tatiana Al-Chueyr
 
Ad

Recently uploaded (20)

PDF
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
PDF
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
PDF
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
PDF
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
PDF
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
PPTX
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
PDF
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
PDF
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
PDF
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
PPTX
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
PDF
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
PDF
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
PDF
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
PDF
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
PDF
The Growing Value and Application of FME & GenAI
Safe Software
 
PDF
Python Conference Singapore - 19 Jun 2025
ninefyi
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
PDF
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
PDF
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
All Things Open
 
PDF
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
The Growing Value and Application of FME & GenAI
Safe Software
 
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
All Things Open
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 

Powering machine learning workflows with Apache Airflow and Python