Simplify and Scale Data Engineering Pipelines with Delta Lake

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Amanda Moran, Databricks
Simplify and Scale
Data Engineering Pipelines
with Delta Lake
#UnifiedDataAnalytics #SparkAISummit

● Solutions Architect @ Databricks
● MS Computer Science, BS Biology
● Previously: HP, Teradata, DataStax, Esgyn
● PMC and Apache Committer on Apache
Trafodion
● 5 Different Distributed Systems
● Course with Udacity on Data Engineering
Today’s Speaker

Agenda
● Data Engineers Nightmares and Dreams
● Data Lifecycle vs the Delta Lifecycle
● Transitioning Data Pipeline to Delta
● How Dreams Become True
● DEMO!
● How to use Delta

Table
(Data gets written
continuously)
AI & Reporting
Events
Batch
Stream
Stream
The Data Engineer’s Journey…

Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey…

Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified View
Validation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
every hour)
The Data Engineer’s Journey… into a Nightmare

Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
every hour)
The Data Engineer’s Journey… into a Nightmare
Can this be simplified?

A Data Engineer’s Dream...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a cost
eﬀicient way without having to choose between batch or streaming

What’s missing?
1. Ability to read consistent data while data is being written
2. Ability to read incrementally from a large table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new data that arrived
5. Ability to handle late arriving data without having to delay downstream
processing
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?

So… What is the answer?
STRUCTURED
STREAMING
+ =
The
Delta
Architecture
1. Unify batch & streaming with a continuous data flow model
2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing costs

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle of the Past
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle of the Past
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Apache
Spark

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Apache
Spark
DW/OLAP

Transitioning from the Data Lifecycle
to the Delta Lake Lifecycle

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
•Dumping ground for raw data
•Often with long retention (years)
•Avoid error-prone parsing
��

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Intermediate data with some cleanup applied.
Queryable for easy debugging!

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Clean data, ready for consumption.
Read with Spark or Presto*

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Streams move data through the Delta Lake
•Low-latency or manually triggered
•Eliminates management of schedules and jobs

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Delta Lake also supports batch jobs
and standard DML
UPDATE
DELETE
MERGE
OVERWRITE
• GDPR, CCPA
• Upserts
INSERT

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Easy to recompute when business logic changes:
• Clear tables
• Restart streams
DELETE DELETE

Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?

Snapshot isolation between writers and
readers
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written

readers
Optimized file source with scalable metadata
handling
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
being written
2. Ability to read incrementally from a large
table with good throughput

readers
handling
Time travel
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
being written

readers
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
being written
4. Ability to replay historical data along new
data that arrived

being written
data that arrived
5. Ability to handle late arriving data without
having to delay downstream processing
readers
handling
Time travel
the same pipeline
Stream any late arriving data added to the
table as they get added
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?

being written
data that arrived
5. Ability to handle late arriving data without
having to delay downstream processing
readers
handling
Time travel
the same pipeline
Stream any late arriving data added to the
table as they get added
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting

dataframe
.write
.format("delta")
.save("/data")
Get Started with Delta using Spark APIs
dataframe
.write
.format("parquet")
.save("/data")
Instead of parquet... … simply say delta
Add Spark Package
pyspark --packages io.delta:delta-core_2.12:0.1.0
bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0
Maven

Build your own
Delta Lake
at
https://delta.io

Notebook from Today
Try the notebook from
Databricks Community
Edition!
Download the notebook at
https://dbricks.co/dlw-01

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Simplify and Scale Data Engineering Pipelines with Delta Lake

More Related Content

What's hot (20)

Similar to Simplify and Scale Data Engineering Pipelines with Delta Lake (18)

More from Databricks (20)

Recently uploaded (20)

Simplify and Scale Data Engineering Pipelines with Delta Lake