SlideShare a Scribd company logo
2
Most read
6
Most read
13
Most read
Building an Analytics
Workflow using Apache
Airflow
Yohei Onishi
PyCon APAC 2019, Feb. 23-24 2019
Presenter Profile
● Yohei Onishi
● Twitter: legoboku, Github:
yohei1126
● Data Engineer at a Japanese
retail company
● Based in Singapore since Oct.
2018
● Apache Airflow Contributor
2
Session overview
● Expected audiences: Data engineers
○ who are working on building a pipleline
○ who are looking for a better workflow solution
● Goal: Provide the following so they can use Airflow
○ Airflow overview and how to author workflow
○ Server configuration and CI/CD in my usecase
○ Recommendations for new users (GCP Cloud
Composer)
3
Data pipeline
data source collect ETL analytics data consumer
micro services
enterprise
systems
IoT devices
object storage
message queue
micro services
enterprise
systems
BI tool
4
Our requirements for ETL worflow
● Already built a data lake on AWS S3 to store structured /
unstructured data
● Want to build a batch based analytics platform
● Requirements
○ Workflow generation by code (Python) rather than GUI
○ OSS: avoid vendor lock in
○ Scalable: batch data processing and workflow
○ Simple and easily extensible
○ Workflow visualization 5
Another workflow engine: Apache Nifi
6
Airflow overview
● Brief history
○ Open sourced by Airbnb and Apache top project
○ Cloud Composer: managed Airflow on GCP
● Characteristics
○ Dynamic workflow generation by Python code
○ Easily extensible so you can fit it to your usecase
○ Scalable by using a message queue to orchestrate
arbitrary number of workers
7
Example: Copy a file from s3 bucket to another
export records
as CSV Singapore region
US region
EU region
transfer it to a
regional bucket
8
local region
DEMO: UI and source code
sample code: https://github.com/yohei1126/pycon-apac-2019-airflow-sample 9
Concept: Directed acyclic graph, operator, task, etc
custom_param_per_dag = {'sg': { ... }, 'eu': { ... }, 'us': { ... }}
for region, v in custom_param_per_dag.items():
dag = DAG('shipment_{}'.format(region), ...)
t1 = PostgresToS3Operator(task_id='db_to_s3', ...)
t2 = S3CopyObjectOperator(task_id='s3_to_s3', ...)
t1 >> t2
globals()[dag] = dag
10
template
t1 = PostgresToS3Operator(
task_id='db_to_s3',
sql="SELECT * FROM shipment WHERE region = '{{ params.region }}'
AND ship_date = '{{ execution_date.strftime("%Y-%m-%d") }}'",
bucket=default_args['source_bucket'],
object_key='{{ params.region }}/{{
execution_date.strftime("%Y%m%d%H%M%S") }}.csv',
params={'region':region},
dag=dag) 11
Operator
class PostgresToS3Operator(BaseOperator):
template_fields = ('sql', 'bucket', 'object_key')
def __init__(self, ..., *args, **kwargs):
super(PostgresToS3Operator, self).__init__(*args, **kwargs)
...
def execute(self, context):
...
12
HA Airflow cluster
executor
(1..N)
worker node (1)
executor
(1..N)
worker node (2)
executor
(1..N)
worker node (1)
... scheduler
master node (1)
web
server
master node
(2)
web
server
LB
admin
Airflow metadata DBCelery result backend message broker 13
http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
CI/CD pipeline
AWS SNS AWS SQS
Github repo
raise / merge
a PR
Airflow worker
polling
run Ansible script
git pull
test
deployment
14
Monitoring
Airflow worker
(EC2)
AWS CloudWatch
notify an error
if DAG fails using
Airflow slack webhook
notify an error if a
CloudWatch Alarm is
triggered slack webhook
15
GCP Cloud Composer
● Fully managed Airflow cluster provided by GCP
○ Fully managed
○ Built in integrated with the other GCP services
● To focus on business logic, you should build Airflow
cluster using GCP composer
16
Create a cluster using CLI
$ gcloud composer environments create ENVIRONMENT_NAME 
--location LOCATION 
OTHER_ARGUMENTS
● New Airflow cluster will be deployed as Kubenetes cluster on GKE
● We usually specify the following options as OTHER_ARGUMENTS
○ infra: instance type, disk size, VPC network, etc.
○ software configuration: Python version, Airflow version, etc.
17
Deploy your source code to the cluster
$ gcloud composer environments storage dags import 
--environment my-environment --location us-central1 
--source test-dags/quickstart.py
● This will upload your source code to cluster specific GCS bucket.
○ You can also directly upload your file to the bucket
● Then the file will be automatically deployed
18
monitoring cluster using Stackdriver
19
Demo: GCP Cloud Composer
● Create an environment
● Stackdriver logging
● GKE as backend
20
Summary
● Data Engineers have to build reliable and scalable data
pipeline to accelate data analytics activities
● Airflow is great tool to author and monitor workflow
● HA Airflow cluster is required for high availablity
● GCP Cloud Compose enables us to build a cluster easily
and focus on business logic
21
References
● Apache Airflow
● GCP Cloud Composer
● Airflow: a workflow management platform
● ETL best practices in Airflow 1.8
● Data Science for Startups: Data Pipelines
● Airflow: Tips, Tricks, and Pitfalls
22

More Related Content

PDF
Introducing Apache Airflow and how we are using it
PDF
Airflow introduction
PDF
Apache airflow
PDF
Apache Airflow
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
PDF
Airflow presentation
PPTX
Apache Airflow overview
PPTX
Airflow 101
Introducing Apache Airflow and how we are using it
Airflow introduction
Apache airflow
Apache Airflow
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Airflow presentation
Apache Airflow overview
Airflow 101

What's hot (20)

PPTX
Apache airflow
PPTX
Airflow presentation
PDF
Apache Airflow
PDF
Apache Airflow Architecture
PDF
Introduction to Apache Airflow
PDF
Airflow for Beginners
PPTX
Airflow - a data flow engine
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PDF
Airflow tutorials hands_on
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
PDF
Building Better Data Pipelines using Apache Airflow
PDF
Apache Airflow
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PPTX
Real-time Analytics with Trino and Apache Pinot
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
Apache Airflow Introduction
PDF
Building an open data platform with apache iceberg
PPTX
Apache Airflow in Production
PDF
Airflow Intro-1.pdf
Apache airflow
Airflow presentation
Apache Airflow
Apache Airflow Architecture
Introduction to Apache Airflow
Airflow for Beginners
Airflow - a data flow engine
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow tutorials hands_on
Orchestrating workflows Apache Airflow on GCP & AWS
Building Better Data Pipelines using Apache Airflow
Apache Airflow
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Real-time Analytics with Trino and Apache Pinot
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Apache Airflow Introduction
Building an open data platform with apache iceberg
Apache Airflow in Production
Airflow Intro-1.pdf
Ad

Similar to Building an analytics workflow using Apache Airflow (20)

PDF
Scalable Clusters On Demand
PDF
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
PDF
From business requirements to working pipelines with apache airflow
PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
PDF
Google Cloud Dataflow
PDF
Improving Apache Spark Downscaling
PDF
Upcoming features in Airflow 2
PDF
Sprint 121
PDF
Machine learning at scale with Google Cloud Platform
PDF
Scaling 100PB Data Warehouse in Cloud
PDF
202107 - Orion introduction - COSCUP
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
PDF
HPC on OpenStack
PDF
Building Kick Ass Video Games for the Cloud
PDF
From airflow to google cloud composer
PDF
Powerful Google developer tools for immediate impact! (2023-24 C)
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Integrating ChatGPT with Apache Airflow
PDF
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
PDF
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
Scalable Clusters On Demand
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
From business requirements to working pipelines with apache airflow
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Google Cloud Dataflow
Improving Apache Spark Downscaling
Upcoming features in Airflow 2
Sprint 121
Machine learning at scale with Google Cloud Platform
Scaling 100PB Data Warehouse in Cloud
202107 - Orion introduction - COSCUP
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
HPC on OpenStack
Building Kick Ass Video Games for the Cloud
From airflow to google cloud composer
Powerful Google developer tools for immediate impact! (2023-24 C)
Scaling your Data Pipelines with Apache Spark on Kubernetes
Integrating ChatGPT with Apache Airflow
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
Ad

More from Yohei Onishi (8)

PDF
Better parking experience with Automatic - Api Days San Francisco
PDF
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
PDF
誰かが言ってたけど人生はRPGのようだ
PDF
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
PDF
ど根性駆動型コミュニティ開発
PDF
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
KEY
自分のコミュニティを始めてみませんか?
PPT
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)
Better parking experience with Automatic - Api Days San Francisco
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
誰かが言ってたけど人生はRPGのようだ
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
ど根性駆動型コミュニティ開発
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
自分のコミュニティを始めてみませんか?
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Modernizing your data center with Dell and AMD
Understanding_Digital_Forensics_Presentation.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Review of recent advances in non-invasive hemoglobin estimation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation_ Review paper, used for researhc scholars
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm

Building an analytics workflow using Apache Airflow

  • 1. Building an Analytics Workflow using Apache Airflow Yohei Onishi PyCon APAC 2019, Feb. 23-24 2019
  • 2. Presenter Profile ● Yohei Onishi ● Twitter: legoboku, Github: yohei1126 ● Data Engineer at a Japanese retail company ● Based in Singapore since Oct. 2018 ● Apache Airflow Contributor 2
  • 3. Session overview ● Expected audiences: Data engineers ○ who are working on building a pipleline ○ who are looking for a better workflow solution ● Goal: Provide the following so they can use Airflow ○ Airflow overview and how to author workflow ○ Server configuration and CI/CD in my usecase ○ Recommendations for new users (GCP Cloud Composer) 3
  • 4. Data pipeline data source collect ETL analytics data consumer micro services enterprise systems IoT devices object storage message queue micro services enterprise systems BI tool 4
  • 5. Our requirements for ETL worflow ● Already built a data lake on AWS S3 to store structured / unstructured data ● Want to build a batch based analytics platform ● Requirements ○ Workflow generation by code (Python) rather than GUI ○ OSS: avoid vendor lock in ○ Scalable: batch data processing and workflow ○ Simple and easily extensible ○ Workflow visualization 5
  • 6. Another workflow engine: Apache Nifi 6
  • 7. Airflow overview ● Brief history ○ Open sourced by Airbnb and Apache top project ○ Cloud Composer: managed Airflow on GCP ● Characteristics ○ Dynamic workflow generation by Python code ○ Easily extensible so you can fit it to your usecase ○ Scalable by using a message queue to orchestrate arbitrary number of workers 7
  • 8. Example: Copy a file from s3 bucket to another export records as CSV Singapore region US region EU region transfer it to a regional bucket 8 local region
  • 9. DEMO: UI and source code sample code: https://github.com/yohei1126/pycon-apac-2019-airflow-sample 9
  • 10. Concept: Directed acyclic graph, operator, task, etc custom_param_per_dag = {'sg': { ... }, 'eu': { ... }, 'us': { ... }} for region, v in custom_param_per_dag.items(): dag = DAG('shipment_{}'.format(region), ...) t1 = PostgresToS3Operator(task_id='db_to_s3', ...) t2 = S3CopyObjectOperator(task_id='s3_to_s3', ...) t1 >> t2 globals()[dag] = dag 10
  • 11. template t1 = PostgresToS3Operator( task_id='db_to_s3', sql="SELECT * FROM shipment WHERE region = '{{ params.region }}' AND ship_date = '{{ execution_date.strftime("%Y-%m-%d") }}'", bucket=default_args['source_bucket'], object_key='{{ params.region }}/{{ execution_date.strftime("%Y%m%d%H%M%S") }}.csv', params={'region':region}, dag=dag) 11
  • 12. Operator class PostgresToS3Operator(BaseOperator): template_fields = ('sql', 'bucket', 'object_key') def __init__(self, ..., *args, **kwargs): super(PostgresToS3Operator, self).__init__(*args, **kwargs) ... def execute(self, context): ... 12
  • 13. HA Airflow cluster executor (1..N) worker node (1) executor (1..N) worker node (2) executor (1..N) worker node (1) ... scheduler master node (1) web server master node (2) web server LB admin Airflow metadata DBCelery result backend message broker 13 http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
  • 14. CI/CD pipeline AWS SNS AWS SQS Github repo raise / merge a PR Airflow worker polling run Ansible script git pull test deployment 14
  • 15. Monitoring Airflow worker (EC2) AWS CloudWatch notify an error if DAG fails using Airflow slack webhook notify an error if a CloudWatch Alarm is triggered slack webhook 15
  • 16. GCP Cloud Composer ● Fully managed Airflow cluster provided by GCP ○ Fully managed ○ Built in integrated with the other GCP services ● To focus on business logic, you should build Airflow cluster using GCP composer 16
  • 17. Create a cluster using CLI $ gcloud composer environments create ENVIRONMENT_NAME --location LOCATION OTHER_ARGUMENTS ● New Airflow cluster will be deployed as Kubenetes cluster on GKE ● We usually specify the following options as OTHER_ARGUMENTS ○ infra: instance type, disk size, VPC network, etc. ○ software configuration: Python version, Airflow version, etc. 17
  • 18. Deploy your source code to the cluster $ gcloud composer environments storage dags import --environment my-environment --location us-central1 --source test-dags/quickstart.py ● This will upload your source code to cluster specific GCS bucket. ○ You can also directly upload your file to the bucket ● Then the file will be automatically deployed 18
  • 19. monitoring cluster using Stackdriver 19
  • 20. Demo: GCP Cloud Composer ● Create an environment ● Stackdriver logging ● GKE as backend 20
  • 21. Summary ● Data Engineers have to build reliable and scalable data pipeline to accelate data analytics activities ● Airflow is great tool to author and monitor workflow ● HA Airflow cluster is required for high availablity ● GCP Cloud Compose enables us to build a cluster easily and focus on business logic 21
  • 22. References ● Apache Airflow ● GCP Cloud Composer ● Airflow: a workflow management platform ● ETL best practices in Airflow 1.8 ● Data Science for Startups: Data Pipelines ● Airflow: Tips, Tricks, and Pitfalls 22