SlideShare a Scribd company logo
Vicky Avison Cox Automotive UK
Alex Bush KPMG Lighthouse New Zealand
Best Practices for Building and Deploying Data
Pipelines in Apache Spark
#UnifiedDataAnalytics #SparkAISummit
Cox Automotive
Data Platform
KPMG Lighthouse
Centre of Excellence for Information and Analytics
We provide services across the data value chain including:
● Data strategy and analytics maturity assessment
● Information management
● Data engineering
● Data warehousing, business intelligence (BI) and data visualisation
● Data science, advanced analytics and artificial intelligence (AI)
● Cloud-based analytics services
What is this talk about?
- What are data pipelines and who builds them?
- Why is data pipeline development difficult to get right?
- How have we changed the way we develop and deploy our data
pipelines?
What are data pipelines and who builds them?
Data Sources
(e.g. files, relational
databases, REST APIs)
What do we mean by ‘Data Pipeline’?
Data Platform
(storage + compute)
Ingest the raw data1
raw data
table_a
table_b
table_c
table_d
table_e
table_f
...
...
prepared data
data model_a
data model_b
data model_c
data model_d
1
Prepare the data for use
in further analysis and
dashboards, meaning:
a. Deduplication
b. Cleansing
c. Enrichment
d. Creation of data
models i.e joins,
aggregations etc.
2
2
Who is in a data team?
Data Engineering
Deep understanding of the
technology
Know how to build robust,
performant data pipelines
Business Intelligence
and Data Science
Deep understanding of the data
Know how to extract business
value from the data
Why is data pipeline development difficult to get
right?
What do we need to think about when building a pipeline?
1. How do we handle late-arriving or duplicate data?
2. How can we ensure that if the pipeline fails part-way through, we can run it again without
any problems?
3. How do we avoid the small-file problem?
4. How do we monitor data quality?
5. How do we configure our application? (credentials, input paths, output paths etc.)
6. How do we maximise performance?
7. How do extract only what we need from the source (e.g. only extract new records from
RDBM)?
What about the business logic?
table_a table_b table_c table_d
cleanup, convert data types, create user-friendly column names, add derived columns etc.
a_b_model
join a and b together
d_counts
group by and perform counts
b_c_d_model
join b, c and the aggregated d
together
deduplicated raw data
deployment
location
environment
(data location)
paths
Hive
databases
What about deployments?
environment (server)
software
interaction response
Traditional software development e.g
web development
softwareinput data
Data development
deployment
deployment
What are the main challenges?
- A lot of overhead for every new data pipeline, even when the problems are
very similar each time
- Production-grade business logic is hard to write without specialist Data
Engineering skills
- No tools or best practice around deploying and managing environments for
data pipelines
How have we changed the way we develop and
deploy our data pipelines?
A long(ish) time ago, in an office quite far away….
How were we dealing with the main challenges?
A lot of overhead for every new data pipeline, even when the problems are very
similar each time
We were… shoehorning new pipeline requirements into a single application in an
attempt to avoid the overhead
How were we dealing with the main challenges?
Production-grade business logic hard to write without specialist Data
Engineering skills
We were… taking business logic defined by our BI and Data Science colleagues
and reimplementing it
How were we dealing with the main challenges?
No tools or best practice around deploying and managing environments for data
pipelines
We were… manually deploying jars, passing environment-specific configuration
to our applications each time we ran them
Business
Logic
Applications
Business
Logic
Applications
Could we make better use of the skills in the team?
Business Intelligence
and Data Science
Data Engineering
Data
Platform
Business
Engagement
Data
Ingestion
Applications Modelling
Consulting
Business
Logic
Definition
Deep understanding of the technology Deep understanding of the data
Data
ExplorationTools and
Frameworks
What tools and frameworks would we need to provide?
Spark and Hadoop KMS Delta Lake Deequ Etc...
Third-party services and libraries
Configuration
Management
Idempotency and
Atomicity
Deduplication
Compaction
Table Metadata
Management
Action Coordination
Boilerplate and
Structuring
Data Engineering frameworks
Business Logic
Data Science and Business
Intelligence Applications
Environment
Management
Application
Deployment
Data Engineering
tools
Data Ingestion
Data Engineering
Applications
How would we design a Data Engineering framework?
Input
Output
{ Inputs }
{ Transformations }
{ Outputs }
Business Logic
Performance
Optimisations
Data Quality
Monitoring
Deduplication
Compaction
ConfigurationMgmt
etc.
Spark and Hadoop
High-level APIs
Business Logic
Complexity hidden
behind high-level APIs
Intuitive structuring of
business logic code
Injection of optimisations
and monitoring
Efficient scheduling of
transformations and
actions
How would we like to manage deployments?
v 0.1 v 0.2
Paths
/data/prod/my_project
Hive databases
prod_my_project
Deployed jars
my_project-0.1.jar
/data/dev/my_project/feature_one
dev_my_project_feature_one
my_project_feature_one-0.2-SNAPSHOT.jar
/data/dev/my_project/feature_two
dev_my_project_feature_two
my_project_feature_two-0.2-SNAPSHOT.jar
master
feature/one
feature/two
my_project-0.2.jar
What does this look like in practice?
Simpler data ingestion
case class SQLServerConnectionDetails(server: String, user: String, password: String)
val dbConf = CaseClassConfigParser[SQLServerConnectionDetails](
SparkFlowContext(spark), "app1.dbconf"
)
Retrieve server configuration from
combination of Spark conf and
Databricks Secrets
Pull deltas from SQLServer
temporal tables and store in
storage layer
Storage layer will capture last
updated values and primary keys
Small files will be compacted once
between 11pm and 4am
Flow is lazy, nothing happens until
execute is called
val flow = Waimak.sparkFlow(spark)
.extractToStorageFromRDBM(
rdbmExtractor = new SQLServerTemporalExtractor(spark, dbConf),
dbSchema = ...,
storageBasePath = ...,
tableConfigs = ...,
extractDateTime = ZonedDateTime.now(),
doCompaction = runSingleCompactionDuringWindow(23, 4)
)("table1", "table2", "table3")
flow.execute()
val flow = Waimak.sparkFlow(spark)
.snapshotFromStorage(basePath, tsNow)("table1", "table2", "table3")
.transform("table1", "table2")("model1")(
(t1, t2) => t1.join(t2, "pk1")
)
.transform("table3", "model1")("model2")(
(t3, m1) => t3.join(m1, "pk2")
)
.sql("table1", "model2")("reporting1",
"""select m2.pk1, count(t1.col1) from model2 m2
left join table1 t1 on m2.pk1 = t1.fpk1
group by m2.pk1"""
)
.writeHiveManagedTable("model_db")("model1", "model2")
.writeHiveManagedTable("reporting_db")("reporting1")
.sparkCache("model2")
.addDeequCheck(
"reporting1",
Check(Error, "Not valid PK").isPrimaryKey("pk1")
)(ExceptionQualityAlert())
Simpler business logic development
Execute with flow.execute()
Read from storage layer and deduplicate
Perform two transformations on labels
using the DataFrame API, generating
two more labels
Perform a Spark SQL transformation on
a table and label generated during a
transform, generating one more label
Write labels to two different databases
Add explicit caching and data quality
monitoring actions
Simpler environment management
case class MySparkAppEnv(project: String, //e.g. my_spark_app
environment: String, //e.g. dev
branch: String //e.g. feature/one
) extends HiveEnv
Environment consists of a base path:
/data/dev/my_spark_app/feature_one/
And a Hive database:
dev_my_spark_app_feature_one
object MySparkApp extends SparkApp[MySparkAppEnv] {
override def run(sparkSession: SparkSession, env: MySparkAppEnv): Unit =
//We have a base path and Hive database available in env via env.basePath and env.baseDBName
???
}
Define application logic given a
SparkSession and an environment
Use MultiAppRunner to run apps
individually or together with
dependencies
spark.waimak.apprunner.apps = my_spark_app
spark.waimak.apprunner.my_spark_app.appClassName = com.example.MySparkApp
spark.waimak.environment.my_spark_app.environment = dev
spark.waimak.environment.my_spark_app.branch = feature/one
Simpler deployments
Questions?
github.com/
CoxAutomotiveDataSolutions/
waimak
Ad

Recommended

Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Data Engineering Basics
Data Engineering Basics
Catherine Kimani
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
Building a modern data warehouse
Building a modern data warehouse
James Serra
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
Databricks
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
DATAVERSITY
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Ivo Andreev
 
Data Lake Overview
Data Lake Overview
James Serra
 
Modern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Business Intelligence (BI) and Data Management Basics
Business Intelligence (BI) and Data Management Basics
amorshed
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
Michael Rainey
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
Data warehouse proposal
Data warehouse proposal
Peter Macdonald
 
The Importance of Metadata
The Importance of Metadata
DATAVERSITY
 
Dimensional Modeling
Dimensional Modeling
aksrauf
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360
Cloudera, Inc.
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
RTTS
 
ETL VS ELT.pdf
ETL VS ELT.pdf
BOSupport
 
Master Data Management - Aligning Data, Process, and Governance
Master Data Management - Aligning Data, Process, and Governance
DATAVERSITY
 
Time to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
Jordan Birdsell
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Demystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Building a Big Data Pipeline
Building a Big Data Pipeline
Jesus Rodriguez
 

More Related Content

What's hot (20)

Data Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Ivo Andreev
 
Data Lake Overview
Data Lake Overview
James Serra
 
Modern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Business Intelligence (BI) and Data Management Basics
Business Intelligence (BI) and Data Management Basics
amorshed
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
Michael Rainey
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
Data warehouse proposal
Data warehouse proposal
Peter Macdonald
 
The Importance of Metadata
The Importance of Metadata
DATAVERSITY
 
Dimensional Modeling
Dimensional Modeling
aksrauf
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360
Cloudera, Inc.
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
RTTS
 
ETL VS ELT.pdf
ETL VS ELT.pdf
BOSupport
 
Master Data Management - Aligning Data, Process, and Governance
Master Data Management - Aligning Data, Process, and Governance
DATAVERSITY
 
Time to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
Jordan Birdsell
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Ivo Andreev
 
Data Lake Overview
Data Lake Overview
James Serra
 
Modern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Business Intelligence (BI) and Data Management Basics
Business Intelligence (BI) and Data Management Basics
amorshed
 
Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
Michael Rainey
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
DATAVERSITY
 
The Importance of Metadata
The Importance of Metadata
DATAVERSITY
 
Dimensional Modeling
Dimensional Modeling
aksrauf
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360
Cloudera, Inc.
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
RTTS
 
ETL VS ELT.pdf
ETL VS ELT.pdf
BOSupport
 
Master Data Management - Aligning Data, Process, and Governance
Master Data Management - Aligning Data, Process, and Governance
DATAVERSITY
 
Time to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
Jordan Birdsell
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 

Similar to Best Practices for Building and Deploying Data Pipelines in Apache Spark (20)

Demystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Building a Big Data Pipeline
Building a Big Data Pipeline
Jesus Rodriguez
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
AWS User Group Kochi
 
Deliveinrg explainable AI
Deliveinrg explainable AI
Gary Allemann
 
Big data pipelines
Big data pipelines
Vivek Aanand Ganesan
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Hadoop-based architecture approaches
Hadoop-based architecture approaches
Miraj Godha
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Data pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
Big Data Week
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Building a modern data platform on AWS. Utrecht AWS Dev Day
Building a modern data platform on AWS. Utrecht AWS Dev Day
javier ramirez
 
Data pipelines from zero
Data pipelines from zero
Lars Albertsson
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
4070949. 89-Test-12-File.pdf
4070949. 89-Test-12-File.pdf
raypoll198
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategy
Himanshu Bari
 
Demystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
Building a Big Data Pipeline
Building a Big Data Pipeline
Jesus Rodriguez
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
AWS User Group Kochi
 
Deliveinrg explainable AI
Deliveinrg explainable AI
Gary Allemann
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Hadoop-based architecture approaches
Hadoop-based architecture approaches
Miraj Godha
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Data pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
Big Data Week
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Building a modern data platform on AWS. Utrecht AWS Dev Day
Building a modern data platform on AWS. Utrecht AWS Dev Day
javier ramirez
 
Data pipelines from zero
Data pipelines from zero
Lars Albertsson
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
4070949. 89-Test-12-File.pdf
4070949. 89-Test-12-File.pdf
raypoll198
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategy
Himanshu Bari
 
Ad

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Ad

Recently uploaded (20)

最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
presentation4.pdf Intro to mcmc methodss
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
Shifting Focus on AI: How it Can Make a Positive Difference
Shifting Focus on AI: How it Can Make a Positive Difference
1508 A/S
 
Flextronics Employee Safety Data-Project-2.pptx
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
Crafting-Research-Recommendations Grade 12.pptx
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
Statistics-and-Computer-Tools-for-Analyzing-of-Assessment-Data.pptx
Statistics-and-Computer-Tools-for-Analyzing-of-Assessment-Data.pptx
pelaezmaryjoy90
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
presentation4.pdf Intro to mcmc methodss
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
Shifting Focus on AI: How it Can Make a Positive Difference
Shifting Focus on AI: How it Can Make a Positive Difference
1508 A/S
 
Flextronics Employee Safety Data-Project-2.pptx
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
Crafting-Research-Recommendations Grade 12.pptx
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
Statistics-and-Computer-Tools-for-Analyzing-of-Assessment-Data.pptx
Statistics-and-Computer-Tools-for-Analyzing-of-Assessment-Data.pptx
pelaezmaryjoy90
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 

Best Practices for Building and Deploying Data Pipelines in Apache Spark

  • 1. Vicky Avison Cox Automotive UK Alex Bush KPMG Lighthouse New Zealand Best Practices for Building and Deploying Data Pipelines in Apache Spark #UnifiedDataAnalytics #SparkAISummit
  • 3. KPMG Lighthouse Centre of Excellence for Information and Analytics We provide services across the data value chain including: ● Data strategy and analytics maturity assessment ● Information management ● Data engineering ● Data warehousing, business intelligence (BI) and data visualisation ● Data science, advanced analytics and artificial intelligence (AI) ● Cloud-based analytics services
  • 4. What is this talk about? - What are data pipelines and who builds them? - Why is data pipeline development difficult to get right? - How have we changed the way we develop and deploy our data pipelines?
  • 5. What are data pipelines and who builds them?
  • 6. Data Sources (e.g. files, relational databases, REST APIs) What do we mean by ‘Data Pipeline’? Data Platform (storage + compute) Ingest the raw data1 raw data table_a table_b table_c table_d table_e table_f ... ... prepared data data model_a data model_b data model_c data model_d 1 Prepare the data for use in further analysis and dashboards, meaning: a. Deduplication b. Cleansing c. Enrichment d. Creation of data models i.e joins, aggregations etc. 2 2
  • 7. Who is in a data team? Data Engineering Deep understanding of the technology Know how to build robust, performant data pipelines Business Intelligence and Data Science Deep understanding of the data Know how to extract business value from the data
  • 8. Why is data pipeline development difficult to get right?
  • 9. What do we need to think about when building a pipeline? 1. How do we handle late-arriving or duplicate data? 2. How can we ensure that if the pipeline fails part-way through, we can run it again without any problems? 3. How do we avoid the small-file problem? 4. How do we monitor data quality? 5. How do we configure our application? (credentials, input paths, output paths etc.) 6. How do we maximise performance? 7. How do extract only what we need from the source (e.g. only extract new records from RDBM)?
  • 10. What about the business logic? table_a table_b table_c table_d cleanup, convert data types, create user-friendly column names, add derived columns etc. a_b_model join a and b together d_counts group by and perform counts b_c_d_model join b, c and the aggregated d together deduplicated raw data
  • 11. deployment location environment (data location) paths Hive databases What about deployments? environment (server) software interaction response Traditional software development e.g web development softwareinput data Data development deployment deployment
  • 12. What are the main challenges? - A lot of overhead for every new data pipeline, even when the problems are very similar each time - Production-grade business logic is hard to write without specialist Data Engineering skills - No tools or best practice around deploying and managing environments for data pipelines
  • 13. How have we changed the way we develop and deploy our data pipelines?
  • 14. A long(ish) time ago, in an office quite far away….
  • 15. How were we dealing with the main challenges? A lot of overhead for every new data pipeline, even when the problems are very similar each time We were… shoehorning new pipeline requirements into a single application in an attempt to avoid the overhead
  • 16. How were we dealing with the main challenges? Production-grade business logic hard to write without specialist Data Engineering skills We were… taking business logic defined by our BI and Data Science colleagues and reimplementing it
  • 17. How were we dealing with the main challenges? No tools or best practice around deploying and managing environments for data pipelines We were… manually deploying jars, passing environment-specific configuration to our applications each time we ran them
  • 18. Business Logic Applications Business Logic Applications Could we make better use of the skills in the team? Business Intelligence and Data Science Data Engineering Data Platform Business Engagement Data Ingestion Applications Modelling Consulting Business Logic Definition Deep understanding of the technology Deep understanding of the data Data ExplorationTools and Frameworks
  • 19. What tools and frameworks would we need to provide? Spark and Hadoop KMS Delta Lake Deequ Etc... Third-party services and libraries Configuration Management Idempotency and Atomicity Deduplication Compaction Table Metadata Management Action Coordination Boilerplate and Structuring Data Engineering frameworks Business Logic Data Science and Business Intelligence Applications Environment Management Application Deployment Data Engineering tools Data Ingestion Data Engineering Applications
  • 20. How would we design a Data Engineering framework? Input Output { Inputs } { Transformations } { Outputs } Business Logic Performance Optimisations Data Quality Monitoring Deduplication Compaction ConfigurationMgmt etc. Spark and Hadoop High-level APIs Business Logic Complexity hidden behind high-level APIs Intuitive structuring of business logic code Injection of optimisations and monitoring Efficient scheduling of transformations and actions
  • 21. How would we like to manage deployments? v 0.1 v 0.2 Paths /data/prod/my_project Hive databases prod_my_project Deployed jars my_project-0.1.jar /data/dev/my_project/feature_one dev_my_project_feature_one my_project_feature_one-0.2-SNAPSHOT.jar /data/dev/my_project/feature_two dev_my_project_feature_two my_project_feature_two-0.2-SNAPSHOT.jar master feature/one feature/two my_project-0.2.jar
  • 22. What does this look like in practice?
  • 23. Simpler data ingestion case class SQLServerConnectionDetails(server: String, user: String, password: String) val dbConf = CaseClassConfigParser[SQLServerConnectionDetails]( SparkFlowContext(spark), "app1.dbconf" ) Retrieve server configuration from combination of Spark conf and Databricks Secrets Pull deltas from SQLServer temporal tables and store in storage layer Storage layer will capture last updated values and primary keys Small files will be compacted once between 11pm and 4am Flow is lazy, nothing happens until execute is called val flow = Waimak.sparkFlow(spark) .extractToStorageFromRDBM( rdbmExtractor = new SQLServerTemporalExtractor(spark, dbConf), dbSchema = ..., storageBasePath = ..., tableConfigs = ..., extractDateTime = ZonedDateTime.now(), doCompaction = runSingleCompactionDuringWindow(23, 4) )("table1", "table2", "table3") flow.execute()
  • 24. val flow = Waimak.sparkFlow(spark) .snapshotFromStorage(basePath, tsNow)("table1", "table2", "table3") .transform("table1", "table2")("model1")( (t1, t2) => t1.join(t2, "pk1") ) .transform("table3", "model1")("model2")( (t3, m1) => t3.join(m1, "pk2") ) .sql("table1", "model2")("reporting1", """select m2.pk1, count(t1.col1) from model2 m2 left join table1 t1 on m2.pk1 = t1.fpk1 group by m2.pk1""" ) .writeHiveManagedTable("model_db")("model1", "model2") .writeHiveManagedTable("reporting_db")("reporting1") .sparkCache("model2") .addDeequCheck( "reporting1", Check(Error, "Not valid PK").isPrimaryKey("pk1") )(ExceptionQualityAlert()) Simpler business logic development Execute with flow.execute() Read from storage layer and deduplicate Perform two transformations on labels using the DataFrame API, generating two more labels Perform a Spark SQL transformation on a table and label generated during a transform, generating one more label Write labels to two different databases Add explicit caching and data quality monitoring actions
  • 25. Simpler environment management case class MySparkAppEnv(project: String, //e.g. my_spark_app environment: String, //e.g. dev branch: String //e.g. feature/one ) extends HiveEnv Environment consists of a base path: /data/dev/my_spark_app/feature_one/ And a Hive database: dev_my_spark_app_feature_one object MySparkApp extends SparkApp[MySparkAppEnv] { override def run(sparkSession: SparkSession, env: MySparkAppEnv): Unit = //We have a base path and Hive database available in env via env.basePath and env.baseDBName ??? } Define application logic given a SparkSession and an environment Use MultiAppRunner to run apps individually or together with dependencies spark.waimak.apprunner.apps = my_spark_app spark.waimak.apprunner.my_spark_app.appClassName = com.example.MySparkApp spark.waimak.environment.my_spark_app.environment = dev spark.waimak.environment.my_spark_app.branch = feature/one