SlideShare a Scribd company logo
Scale and Optimize
Data Engineering Pipelines
with Software Engineering Best Practices:
Modularity and Automated Testing
© 2020 / Levi Strauss & Co.
Qiang MENG
Sr. Data Engineer, Levi Strauss & Co.
Qiang MENG
• Levi Strauss & Co., Belgium
• Databricks – AWS – Airflow – Spark - Dataiku
• H&M Group, Sweden
• Databricks – Azure – Airflow – Spark - Hive
• Softbank Robotics, France
• AWS – Luigi – Spark
© 2020 / Levi Strauss & Co.
© 2020 / Levi Strauss & Co.
Levi’s Data, Analytics & AI
© 2020 / Levi Strauss & Co.
Demand Forecast
Loyalty
Price & Promo
Recommendation
0%
10%
20%
30%
40%
50%
60%
70%
Q3 Digital Sales Increase
USA EU AMA
NLP
Allocation
Computer Vision
Agenda
I. Hypothesis
• A Data Engineering project in Apache Airflow
II. Optimization
• Software Engineering best practices
III. Auto-testing
• Unit test, Functional test, End to End test, and
Smoky test
© 2020 / Levi Strauss & Co.
I. Hypothesis
© 2020 / Levi Strauss & Co.
© 2020 / Levi Strauss & Co.
Hypothesis
Data Engineering – Cloud - Airflow - TB-level - Batch
© 2020 / Levi Strauss & Co.
Hypothesis
Dev Environments
▪ Local – Dev – Preprod – Prod
Sample Data
▪ ETL in local / Dev
Code Versioning – GIT
• feature - dev - release - hotfix – master
Conventions
• DB, Schema, Table, Column, Pipeline
Container-orchestration
• Docker, K8S, …
Spark clusters
• Databricks, EMR, …
Data Versioning
• Delta Lake, …
CICD
Airflow
▪ airflow
• dags
• operators
• …
▪ utils
▪ modules
▪ docs
▪ tests
▪ others
• airflow
• dags
• perators
• …
• others
© 2020 / Levi Strauss & Co.
II. Optimization
© 2020 / Levi Strauss & Co.
A Big Data ETL
Optimization - Airflow DAG
Optimization - Airflow DAG
A simple template of tasks – Data Feeds Based
© 2020 / Levi Strauss & Co.
© 2020 / Levi Strauss & Co.
Optimization - Airflow Operator
• Default Operators
• Keep as they are
• Customized Operators
• Hive
• Spark
• Deduplication
• Data Validation
• Data Profiling
• …
Do the real work
Optimization - Modules and Utils
▪ Modules
▪ Utils
Optimization - OOP
Optimization - OOP
© 2020 / Levi Strauss & Co.
Optimization - Design Patten
© 2020 / Levi Strauss & Co.
Optimization - Formatting
© 2020 / Levi Strauss & Co.
Optimization - Docs
• Docstring - Sphinx
• Data Catalog – Markdown
© 2020 / Levi Strauss & Co.
Docstring
© 2020 / Levi Strauss & Co.
Auto Data
Catalog
with
Markdown
© 2020 / Levi Strauss & Co.
III. Auto-testing
© 2020 / Levi Strauss & Co.
Auto-Testing
• Unit tests
• Integration tests
• End-to-End tests
• Smoky tests
© 2020 / Levi Strauss & Co.
Unit Tests
© 2020 / Levi Strauss & Co.
Integration Tests
© 2020 / Levi Strauss & Co.
End-to-End Tests
© 2020 / Levi Strauss & Co.
Smoke Tests
© 2020 / Levi Strauss & Co.
READY FOR
THE NEXT
167 YEARS
© 2020 / Levi Strauss & Co.
Sign off/Thank You/Questions?
© 2020 / Levi Strauss & Co.
READY FOR THE NEXT 166 YEARS
THANK YOU
© 2020 / Levi Strauss & Co.
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
© 2020 / Levi Strauss & Co.

More Related Content

What's hot (20)

PPTX
Data Pipelines with Kafka Connect
Kaufman Ng
 
PDF
Snowflake Company Presentation
AndrewJiang18
 
PPTX
Data Domain-Driven Design
Kiran Kumar Chittoori
 
PDF
Drifting Away: Testing ML Models in Production
Databricks
 
PDF
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
PDF
Apache Airflow
Sumit Maheshwari
 
PDF
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
PPTX
Analyzing 1.2 Million Network Packets per Second in Real-time
DataWorks Summit
 
PDF
Observability for Data Pipelines With OpenLineage
Databricks
 
PPTX
DevOps + DataOps = Digital Transformation
Delphix
 
PPTX
Designing modern dw and data lake
punedevscom
 
PPTX
AWS Lake Formation Deep Dive
Cobus Bernard
 
PDF
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo
 
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
PDF
Actionable Insights with AI - Snowflake for Data Science
Harald Erb
 
PPTX
Introduction to Microsoft Power BI
Exilesoft
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PPTX
Power BI Overview, Deployment and Governance
James Serra
 
PPTX
Screw DevOps, Let's Talk DataOps
Kellyn Pot'Vin-Gorman
 
Data Pipelines with Kafka Connect
Kaufman Ng
 
Snowflake Company Presentation
AndrewJiang18
 
Data Domain-Driven Design
Kiran Kumar Chittoori
 
Drifting Away: Testing ML Models in Production
Databricks
 
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
Apache Airflow
Sumit Maheshwari
 
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
Analyzing 1.2 Million Network Packets per Second in Real-time
DataWorks Summit
 
Observability for Data Pipelines With OpenLineage
Databricks
 
DevOps + DataOps = Digital Transformation
Delphix
 
Designing modern dw and data lake
punedevscom
 
AWS Lake Formation Deep Dive
Cobus Bernard
 
Denodo Data Virtualization Platform: Security (session 5 from Architect to Ar...
Denodo
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Actionable Insights with AI - Snowflake for Data Science
Harald Erb
 
Introduction to Microsoft Power BI
Exilesoft
 
Flink vs. Spark
Slim Baltagi
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Power BI Overview, Deployment and Governance
James Serra
 
Screw DevOps, Let's Talk DataOps
Kellyn Pot'Vin-Gorman
 

Similar to Scale and Optimize Data Engineering Pipelines with Software Engineering Best Practices: Modularity and Automated Testing (20)

PDF
Automated Production Ready ML at Scale
Databricks
 
PDF
Apply MLOps at Scale
Databricks
 
PDF
Apply MLOps at Scale by H&M
Databricks
 
PDF
Data-Drive DevOps: Mining Machine Data for "Metrics that Matter"
Splunk
 
PDF
Quality engineering in the digital age... Why? How? (ASQF Keynote by Rik Mars...
Rik Marselis
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
DevOps Days Rockies MLOps
Matthew Reynolds
 
PDF
Quality for DevOps teams - Quality engineering in the DevOps culture
Rik Marselis
 
PDF
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Databricks
 
PDF
Anything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel Guide
Ahmet Akyol
 
PDF
Maciej Marek (Philip Morris International) - The Tools of The Trade
Codiax
 
PDF
Just ask: Designing intent-driven algos
Anna Schneider
 
PDF
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
PPTX
Datasciencein E-commerce industry
Rakuten Group, Inc.
 
PDF
Rakuten - Recommendation Platform
Karthik Murugesan
 
PDF
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
PDF
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 
PDF
Quality Engineering and Testing with TMAP in DevOps IT delivery
Rik Marselis
 
PPTX
TestExpo Quality Engineering & Sustainability
Rik Marselis
 
PDF
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
Automated Production Ready ML at Scale
Databricks
 
Apply MLOps at Scale
Databricks
 
Apply MLOps at Scale by H&M
Databricks
 
Data-Drive DevOps: Mining Machine Data for "Metrics that Matter"
Splunk
 
Quality engineering in the digital age... Why? How? (ASQF Keynote by Rik Mars...
Rik Marselis
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
DevOps Days Rockies MLOps
Matthew Reynolds
 
Quality for DevOps teams - Quality engineering in the DevOps culture
Rik Marselis
 
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Databricks
 
Anything Data: Big, Streaming, NoSQL, Cloud, Science ... A Sloppy Travel Guide
Ahmet Akyol
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Codiax
 
Just ask: Designing intent-driven algos
Anna Schneider
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
Datasciencein E-commerce industry
Rakuten Group, Inc.
 
Rakuten - Recommendation Platform
Karthik Murugesan
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 
Quality Engineering and Testing with TMAP in DevOps IT delivery
Rik Marselis
 
TestExpo Quality Engineering & Sustainability
Rik Marselis
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Ad

Recently uploaded (20)

PPTX
Mynd company all details what they are doing a
AniketKadam40952
 
PDF
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
PDF
Data science AI/Ml basics to learn .pdf
deokhushi04
 
PDF
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
PPTX
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
PPTX
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
PPTX
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PPTX
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
PPTX
Project_Update_Summary.for the use from PM
Odysseas Lekatsas
 
PPTX
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
PDF
Datàaaaaaaaaaengineeeeeeeeeeeeeeeeeeeeeee
juadsr96
 
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
PPSX
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
PDF
SaleServicereport and SaleServicereport
2251330007
 
PDF
Predicting Titanic Survival Presentation
praxyfarhana
 
PDF
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
Mynd company all details what they are doing a
AniketKadam40952
 
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
Data science AI/Ml basics to learn .pdf
deokhushi04
 
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
Project_Update_Summary.for the use from PM
Odysseas Lekatsas
 
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
Datàaaaaaaaaaengineeeeeeeeeeeeeeeeeeeeeee
juadsr96
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
SaleServicereport and SaleServicereport
2251330007
 
Predicting Titanic Survival Presentation
praxyfarhana
 
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 

Scale and Optimize Data Engineering Pipelines with Software Engineering Best Practices: Modularity and Automated Testing