SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Amanda Moran, Databricks
Simplify and Scale
Data Engineering Pipelines
with Delta Lake
#UnifiedDataAnalytics #SparkAISummit
● Solutions Architect @ Databricks
● MS Computer Science, BS Biology
● Previously: HP, Teradata, DataStax, Esgyn
● PMC and Apache Committer on Apache
Trafodion
● 5 Different Distributed Systems
● Course with Udacity on Data Engineering
Today’s Speaker
Agenda
● Data Engineers Nightmares and Dreams
● Data Lifecycle vs the Delta Lifecycle
● Transitioning Data Pipeline to Delta
● How Dreams Become True
● DEMO!
● How to use Delta
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch
Stream
Stream
The Data Engineer’s Journey…
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey…
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified View
Validation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey… into a Nightmare
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey… into a Nightmare
Can this be simplified?
A Data Engineer’s Dream...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a cost
efficient way without having to choose between batch or streaming
What’s missing?
1. Ability to read consistent data while data is being written
2. Ability to read incrementally from a large table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new data that arrived
5. Ability to handle late arriving data without having to delay downstream
processing
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
So… What is the answer?
STRUCTURED
STREAMING
+ =
The
Delta
Architecture
1. Unify batch & streaming with a continuous data flow model
2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing costs
Let’s try it instead with
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
What does this remind you of?
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle of the Past
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle of the Past
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Apache
Spark
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Apache
Spark
DW/OLAP
Transitioning from the Data Lifecycle
to the Delta Lake Lifecycle
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
•Dumping ground for raw data
•Often with long retention (years)
•Avoid error-prone parsing
��
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Intermediate data with some cleanup applied.
Queryable for easy debugging!
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Clean data, ready for consumption.
Read with Spark or Presto*
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Streams move data through the Delta Lake
•Low-latency or manually triggered
•Eliminates management of schedules and jobs
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Delta Lake also supports batch jobs
and standard DML
UPDATE
DELETE
MERGE
OVERWRITE
• GDPR, CCPA
• Upserts
INSERT
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Easy to recompute when business logic changes:
• Clear tables
• Restart streams
DELETE DELETE
How the Dream Becomes True
Demo Time
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
Connecting the dots...
Snapshot isolation between writers and
readers
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new
data that arrived
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new
data that arrived
5. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new
data that arrived
5. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
How do I use ?
dataframe
.write
.format("delta")
.save("/data")
Get Started with Delta using Spark APIs
dataframe
.write
.format("parquet")
.save("/data")
Instead of parquet... … simply say delta
Add Spark Package
pyspark --packages io.delta:delta-core_2.12:0.1.0
bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0
Maven
Build your own
Delta Lake
at
https://delta.io
Join the Community
Notebook from Today
Try the notebook from
Databricks Community
Edition!
Download the notebook at
https://dbricks.co/dlw-01
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PDF
When NOT to use Apache Kafka?
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Optimizing Apache Spark SQL Joins
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
PPTX
Introduction to Azure Databricks
PDF
ksqlDB: A Stream-Relational Database System
When NOT to use Apache Kafka?
Batch Processing at Scale with Flink & Iceberg
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Optimizing Apache Spark SQL Joins
Kafka Tutorial - basics of the Kafka streaming platform
Introduction to Azure Databricks
ksqlDB: A Stream-Relational Database System

What's hot (20)

PDF
Databricks: A Tool That Empowers You To Do More With Data
PPSX
Cloud Architecture - Multi Cloud, Edge, On-Premise
PDF
Apache Spark Overview
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
Introducing the Apache Flink Kubernetes Operator
PDF
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PPTX
Flink vs. Spark
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Apache Flink in the Cloud-Native Era
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
Making Apache Spark Better with Delta Lake
PPTX
Terraform Basics
PDF
PySpark in practice slides
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Apache Spark At Scale in the Cloud
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Databricks: A Tool That Empowers You To Do More With Data
Cloud Architecture - Multi Cloud, Edge, On-Premise
Apache Spark Overview
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Introducing the Apache Flink Kubernetes Operator
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Flink vs. Spark
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Apache Flink in the Cloud-Native Era
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Apache Spark in Depth: Core Concepts, Architecture & Internals
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Making Apache Spark Better with Delta Lake
Terraform Basics
PySpark in practice slides
Apache Kafka Architecture & Fundamentals Explained
Apache Spark At Scale in the Cloud
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Ad

Similar to Simplify and Scale Data Engineering Pipelines with Delta Lake (18)

PDF
Delta Architecture
PDF
The delta architecture
PDF
Delta from a Data Engineer's Perspective
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
PDF
Delta Lake: Open Source Reliability w/ Apache Spark
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
Apache CarbonData+Spark to realize data convergence and Unified high performa...
PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
PDF
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PDF
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
PDF
Realtime Analytics on AWS
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PPTX
Databricks Platform.pptx
Delta Architecture
The delta architecture
Delta from a Data Engineer's Perspective
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Apache CarbonData+Spark to realize data convergence and Unified high performa...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Realtime Analytics on AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Databricks Platform.pptx
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to machine learning and Linear Models
PDF
Mega Projects Data Mega Projects Data
Galatica Smart Energy Infrastructure Startup Pitch Deck
climate analysis of Dhaka ,Banglades.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
ISS -ESG Data flows What is ESG and HowHow
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction-to-Cloud-ComputingFinal.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Knowledge Engineering Part 1
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
annual-report-2024-2025 original latest.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to machine learning and Linear Models
Mega Projects Data Mega Projects Data

Simplify and Scale Data Engineering Pipelines with Delta Lake

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Amanda Moran, Databricks Simplify and Scale Data Engineering Pipelines with Delta Lake #UnifiedDataAnalytics #SparkAISummit
  • 3. ● Solutions Architect @ Databricks ● MS Computer Science, BS Biology ● Previously: HP, Teradata, DataStax, Esgyn ● PMC and Apache Committer on Apache Trafodion ● 5 Different Distributed Systems ● Course with Udacity on Data Engineering Today’s Speaker
  • 4. Agenda ● Data Engineers Nightmares and Dreams ● Data Lifecycle vs the Delta Lifecycle ● Transitioning Data Pipeline to Delta ● How Dreams Become True ● DEMO! ● How to use Delta
  • 5. Table (Data gets written continuously) AI & Reporting Events Batch Stream Stream The Data Engineer’s Journey…
  • 6. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Reprocessing Update & Merge Stream Table (Data gets compacted every hour) The Data Engineer’s Journey…
  • 7. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified View Validation Updates & Merge get complex with data lake Reprocessing Update & Merge Stream Table (Data gets compacted every hour) The Data Engineer’s Journey… into a Nightmare
  • 8. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Updates & Merge get complex with data lake Reprocessing Update & Merge Stream Table (Data gets compacted every hour) The Data Engineer’s Journey… into a Nightmare Can this be simplified?
  • 9. A Data Engineer’s Dream... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
  • 10. What’s missing? 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes 4. Ability to replay historical data along new data that arrived 5. Ability to handle late arriving data without having to delay downstream processing Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 11. So… What is the answer? STRUCTURED STREAMING + = The Delta Architecture 1. Unify batch & streaming with a continuous data flow model 2. Infinite retention to replay/reprocess historical events as needed 3. Independent, elastic compute and storage to scale while balancing costs
  • 12. Let’s try it instead with
  • 13. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption. *Data Quality Levels *
  • 14. What does this remind you of?
  • 15. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Data Lifecycle of the Past Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality
  • 16. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Data Lifecycle of the Past Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Data Lake
  • 17. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Data Lifecycle Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Data Lake Apache Spark
  • 18. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Data Lifecycle Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Data Lake Apache Spark DW/OLAP
  • 19. Transitioning from the Data Lifecycle to the Delta Lake Lifecycle
  • 20. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption. *Data Quality Levels *
  • 21. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis •Dumping ground for raw data •Often with long retention (years) •Avoid error-prone parsing ��
  • 22. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Intermediate data with some cleanup applied. Queryable for easy debugging!
  • 23. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Clean data, ready for consumption. Read with Spark or Presto*
  • 24. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Streams move data through the Delta Lake •Low-latency or manually triggered •Eliminates management of schedules and jobs
  • 25. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Delta Lake also supports batch jobs and standard DML UPDATE DELETE MERGE OVERWRITE • GDPR, CCPA • Upserts INSERT
  • 26. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Easy to recompute when business logic changes: • Clear tables • Restart streams DELETE DELETE
  • 27. How the Dream Becomes True
  • 29. Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 30. Connecting the dots... Snapshot isolation between writers and readers Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written
  • 31. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput
  • 32. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes
  • 33. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes 4. Ability to replay historical data along new data that arrived
  • 34. 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes 4. Ability to replay historical data along new data that arrived 5. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 35. 1. Ability to read consistent data while data is being written 2. Ability to read incrementally from a large table with good throughput 3. Ability to rollback in case of bad writes 4. Ability to replay historical data along new data that arrived 5. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting
  • 36. How do I use ?
  • 37. dataframe .write .format("delta") .save("/data") Get Started with Delta using Spark APIs dataframe .write .format("parquet") .save("/data") Instead of parquet... … simply say delta Add Spark Package pyspark --packages io.delta:delta-core_2.12:0.1.0 bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0 Maven
  • 38. Build your own Delta Lake at https://delta.io
  • 40. Notebook from Today Try the notebook from Databricks Community Edition! Download the notebook at https://dbricks.co/dlw-01
  • 41. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT