SlideShare a Scribd company logo
Large Scale Lakehouse
Implementation Using
Structured Streaming
Tomasz Magdanski
Sr Director – Data Platforms
Asurion_Public
Agenda
§ About Asurion
§ How did we get here
§ Scalable and cost-
effective job execution
§ Lessons Learned
Asurion helps people protect, connect
and enjoy the latest tech – to make life a
little easier. Every day our team of
10,000 Experts helps nearly 300 million
people around the world solve the most
common and uncommon tech issues.
We’re just a call, tap, click or visit away
for everything from getting a same-day
replacement of your smartphone, to
helping you stream or connect with no
buffering, bumps or bewilderment.
We think you should stay connected and
get the most from the tech you love… no
matter the type of tech or where you
purchased it.
Asurion_Public
Scope of work
▪ 4000+ source tables
▪ 4000+ L1 tables
▪ 3500+ L2 tables
▪ Streams
Kafka, Kinesis, SNS, SQS
▪ APIs
▪ Flat Files
▪ AWS, Azure and On Prem
• 300+ Data Warehouse
tables
• 600+ Data Marts
• Data Warehouse
• Ingestion
• 10,000+ Views
• 2,000+ Reports
• Consumption
Asurion_Public
Why Lakehouse ?
§ Lambda Architecture
§ D -1 latency
§ Limited Throughput
§ Hard to scale
§ Wide technology stack
§ Single pipeline
§ Near real time latency
§ Scalable with Apache Spark
§ Integrated ecosystem
§ Narrow technology stack
Lakehouse
Previous architecture
Asurion_Public
Pre-Prod Compute
Enhanced Data Flow
Production Data
AWS Prod Acct
Production Compute
AWS Pre-Prod
Acct
Asurion_Public
Job Execution
Ingestion Job
(Spark)
1st table
…….
4000th
table
• Spark Structured Streaming
• Unify the entry points
• S3 -> read with Autoloader
• Kafka -> read with Spark
• Use Databricks Jobs and Job
Clusters
• Single code base in Scala
• CICD Pipeline
Asurion_Public
Job Execution
Ingestion Job
(Spark)
streamingDF.writeStream.foreachBatch {
(batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(delta).mode(append).save(...) // append to L1
deltaTable.merge(batchDF, )…execute() // merge to L2
batchDF.unpersist()
}
• Spark Structured Streaming
• All target tables are Delta
• Append table (L1) - SCD2
• Merge table (L2) – SCD1
Asurion_Public
Trigger choice
▪ Databricks only allows 1000
jobs, and we have 4000 tables
▪ Best case scenario 4000 * 3
nodes = 12,000 nodes
• Up to 40 streams on a cluster
• Large clusters
• Huge compute waste for
infrequently updated tables
• Many streaming jobs per cluster
• One streaming job per cluster
• No continues execution
• Hundreds of jobs per cluster
• Job can migrate to new cluster
between executions
• Configs are refreshed at each run
• ML can be used to balance jobs
• Many trigger once jobs per cluster
Ingestion Job
Ingestion Job
Ingestion Job
Ingestion Job
Ingestion Job
Ingestion Job
Ingestion Job
Asurion_Public
Lessons Learned – Cloud Files
Cloud Files: S3 notification -> SNS -> SQS
• S3 notification limit: 100 per bucket
• SQS and SNS are not tagged by default
• SNS hard limits:
• ListSubscriptions 30 per second.
• ListSubscriptionsByTopic 30 per second.
Asurion_Public
Lessons Learned – Cloud Files
Pre-Prod
Compute
Production
Data
notification
SNS
SQS
Production
Compute
SQS
SNS
Asurion_Public
Lessons Learned – CDC and DMS - timestamps
CDC: Change Data Capture
• Load and CDC
• Earlier version of the row
may have latest timestamp
• Reset DMS Timestamp to 0
Load files (hours)
CDC files (minutes)
Asurion_Public
Lessons Learned – CDC and DMS - transformations
• DMS Data types conversation
• SQL Server: Tiny Int converted to UINT
• Oracle: Numeric is converted to DECIMAL(38,10), set
numberDataTypeScale=-2
Asurion_Public
Lessons Learned – CDC and DMS - other
• Load files can be large and cause skew in dataframe when read
• DMS files are NOT partitioned
• DMS files should be removed when task is restarted
• Set TargetTablePrepMode = DROP_AND_CREATE
• Some sources can have large transactions with many updates to the same row –
bring LSN in DMS job for deterministic merging
• If database table has no PKs but it has unique constraints with nulls – replace null
with string “null” for deterministic merging
Asurion_Public
Lessons Learned – Kafka
• Spark read from Kafka can be slow
• If topic doesn’t have large number of partitions and,
• Topic has a lot of data
• Set: minPartitions and maxOffsetsPerTrigger to high number to speed reading
• L2 read from L1 instead of source
• Actions take time in the above scenario. Optimize and use L1 as a source for merge
• BatchID: add it to the data
Asurion_Public
Lessons Learned – Kafka
• Stream all the data to Kafka first
• Bring data from SNS, SQS, Kinesis to Kafka using Kafka Connect
• Spark reader for Kafka supports Trigger once
Asurion_Public
Lessons Learned – Delta
• Optimize the table after initial load
• Use Optimized Writes after initial load
• delta.autoOptimize.optimizeWrite = true
• Move merge and batch id columns to the front of the dataframe
• If merge columns are incremental use Z Ordering
• Use partitions
• Use i3 instance types with IO caching
Asurion_Public
Lessons Learned – Delta
• Use S3 paths to register Delta tables in Hive
• Generate manifest files and enable auto updates
• delta.compatibility.symlinkFormatManifest.enabled = true
• Spark and Presto views are not compatible at this time
• Extract delta stats
• Row count, last modified, table size
Asurion_Public
Lessons Learned – Delta
• Streaming from Delta table in append mode
• Streaming from Delta table when merging
• a
• Merging rewrites a lot of data
• Delta will stream out the whole file
• Use for each batch to filter data down based on the batchID
Asurion_Public
Lessons Learned – SQL Analytics
• How are we using it
• Collect metrics from APIs to Delta table
• Only one meta store is allowed at this time
• No UDF support
• Learn to troubleshoot DAGs and Spark Jobs
Q&A
Asurion_Public
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Asurion_Public

More Related Content

PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
Parquet performance tuning: the missing guide
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Spark SQL
PDF
Dynamic Partition Pruning in Apache Spark
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
The Parquet Format and Performance Optimization Opportunities
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Parquet performance tuning: the missing guide
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Spark SQL
Dynamic Partition Pruning in Apache Spark

What's hot (20)

PDF
Spark with Delta Lake
PDF
Apache Iceberg: An Architectural Look Under the Covers
PDF
Apache Spark At Scale in the Cloud
PPTX
How to Actually Tune Your Spark Jobs So They Work
PDF
Change Data Feed in Delta
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Understanding Query Plans and Spark UIs
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Delta Lake: Optimizing Merge
PDF
Delta: Building Merge on Read
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Azure Synapse Analytics
PDF
Productizing Structured Streaming Jobs
PPT
Oracle Transparent Data Encryption (TDE) 12c
Spark with Delta Lake
Apache Iceberg: An Architectural Look Under the Covers
Apache Spark At Scale in the Cloud
How to Actually Tune Your Spark Jobs So They Work
Change Data Feed in Delta
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Understanding Query Plans and Spark UIs
Incremental View Maintenance with Coral, DBT, and Iceberg
Optimizing Delta/Parquet Data Lakes for Apache Spark
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How We Optimize Spark SQL Jobs With parallel and sync IO
Deep Dive: Memory Management in Apache Spark
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Delta Lake: Optimizing Merge
Delta: Building Merge on Read
Apache Iceberg - A Table Format for Hige Analytic Datasets
Azure Synapse Analytics
Productizing Structured Streaming Jobs
Oracle Transparent Data Encryption (TDE) 12c
Ad

Similar to Large Scale Lakehouse Implementation Using Structured Streaming (20)

PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PDF
Lessons Learned: Using Spark and Microservices
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark,
PDF
Data pipelines from zero to solid
PPTX
Software architecture for data applications
PDF
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
PPTX
Building real time Data Pipeline using Spark Streaming
PDF
Towards Data Operations
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PDF
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
ODP
Eric Fan Insight Project Demo
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PDF
A look ahead at spark 2.0
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Lessons Learned: Using Spark and Microservices
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark,
Data pipelines from zero to solid
Software architecture for data applications
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Building real time Data Pipeline using Spark Streaming
Towards Data Operations
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Eric Fan Insight Project Demo
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
A look ahead at spark 2.0
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PPT
Predictive modeling basics in data cleaning process
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Computer network topology notes for revision
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Introduction to the R Programming Language
PPTX
Managing Community Partner Relationships
PPTX
Database Infoormation System (DBIS).pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Predictive modeling basics in data cleaning process
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Qualitative Qantitative and Mixed Methods.pptx
Miokarditis (Inflamasi pada Otot Jantung)
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
annual-report-2024-2025 original latest.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
oil_refinery_comprehensive_20250804084928 (1).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IB Computer Science - Internal Assessment.pptx
climate analysis of Dhaka ,Banglades.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Computer network topology notes for revision
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to the R Programming Language
Managing Community Partner Relationships
Database Infoormation System (DBIS).pptx
Business Analytics and business intelligence.pdf
Introduction-to-Cloud-ComputingFinal.pptx

Large Scale Lakehouse Implementation Using Structured Streaming

  • 1. Large Scale Lakehouse Implementation Using Structured Streaming Tomasz Magdanski Sr Director – Data Platforms
  • 2. Asurion_Public Agenda § About Asurion § How did we get here § Scalable and cost- effective job execution § Lessons Learned
  • 3. Asurion helps people protect, connect and enjoy the latest tech – to make life a little easier. Every day our team of 10,000 Experts helps nearly 300 million people around the world solve the most common and uncommon tech issues. We’re just a call, tap, click or visit away for everything from getting a same-day replacement of your smartphone, to helping you stream or connect with no buffering, bumps or bewilderment. We think you should stay connected and get the most from the tech you love… no matter the type of tech or where you purchased it.
  • 4. Asurion_Public Scope of work ▪ 4000+ source tables ▪ 4000+ L1 tables ▪ 3500+ L2 tables ▪ Streams Kafka, Kinesis, SNS, SQS ▪ APIs ▪ Flat Files ▪ AWS, Azure and On Prem • 300+ Data Warehouse tables • 600+ Data Marts • Data Warehouse • Ingestion • 10,000+ Views • 2,000+ Reports • Consumption
  • 5. Asurion_Public Why Lakehouse ? § Lambda Architecture § D -1 latency § Limited Throughput § Hard to scale § Wide technology stack § Single pipeline § Near real time latency § Scalable with Apache Spark § Integrated ecosystem § Narrow technology stack Lakehouse Previous architecture
  • 6. Asurion_Public Pre-Prod Compute Enhanced Data Flow Production Data AWS Prod Acct Production Compute AWS Pre-Prod Acct
  • 7. Asurion_Public Job Execution Ingestion Job (Spark) 1st table ……. 4000th table • Spark Structured Streaming • Unify the entry points • S3 -> read with Autoloader • Kafka -> read with Spark • Use Databricks Jobs and Job Clusters • Single code base in Scala • CICD Pipeline
  • 8. Asurion_Public Job Execution Ingestion Job (Spark) streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) => batchDF.persist() batchDF.write.format(delta).mode(append).save(...) // append to L1 deltaTable.merge(batchDF, )…execute() // merge to L2 batchDF.unpersist() } • Spark Structured Streaming • All target tables are Delta • Append table (L1) - SCD2 • Merge table (L2) – SCD1
  • 9. Asurion_Public Trigger choice ▪ Databricks only allows 1000 jobs, and we have 4000 tables ▪ Best case scenario 4000 * 3 nodes = 12,000 nodes • Up to 40 streams on a cluster • Large clusters • Huge compute waste for infrequently updated tables • Many streaming jobs per cluster • One streaming job per cluster • No continues execution • Hundreds of jobs per cluster • Job can migrate to new cluster between executions • Configs are refreshed at each run • ML can be used to balance jobs • Many trigger once jobs per cluster Ingestion Job Ingestion Job Ingestion Job Ingestion Job Ingestion Job Ingestion Job Ingestion Job
  • 10. Asurion_Public Lessons Learned – Cloud Files Cloud Files: S3 notification -> SNS -> SQS • S3 notification limit: 100 per bucket • SQS and SNS are not tagged by default • SNS hard limits: • ListSubscriptions 30 per second. • ListSubscriptionsByTopic 30 per second.
  • 11. Asurion_Public Lessons Learned – Cloud Files Pre-Prod Compute Production Data notification SNS SQS Production Compute SQS SNS
  • 12. Asurion_Public Lessons Learned – CDC and DMS - timestamps CDC: Change Data Capture • Load and CDC • Earlier version of the row may have latest timestamp • Reset DMS Timestamp to 0 Load files (hours) CDC files (minutes)
  • 13. Asurion_Public Lessons Learned – CDC and DMS - transformations • DMS Data types conversation • SQL Server: Tiny Int converted to UINT • Oracle: Numeric is converted to DECIMAL(38,10), set numberDataTypeScale=-2
  • 14. Asurion_Public Lessons Learned – CDC and DMS - other • Load files can be large and cause skew in dataframe when read • DMS files are NOT partitioned • DMS files should be removed when task is restarted • Set TargetTablePrepMode = DROP_AND_CREATE • Some sources can have large transactions with many updates to the same row – bring LSN in DMS job for deterministic merging • If database table has no PKs but it has unique constraints with nulls – replace null with string “null” for deterministic merging
  • 15. Asurion_Public Lessons Learned – Kafka • Spark read from Kafka can be slow • If topic doesn’t have large number of partitions and, • Topic has a lot of data • Set: minPartitions and maxOffsetsPerTrigger to high number to speed reading • L2 read from L1 instead of source • Actions take time in the above scenario. Optimize and use L1 as a source for merge • BatchID: add it to the data
  • 16. Asurion_Public Lessons Learned – Kafka • Stream all the data to Kafka first • Bring data from SNS, SQS, Kinesis to Kafka using Kafka Connect • Spark reader for Kafka supports Trigger once
  • 17. Asurion_Public Lessons Learned – Delta • Optimize the table after initial load • Use Optimized Writes after initial load • delta.autoOptimize.optimizeWrite = true • Move merge and batch id columns to the front of the dataframe • If merge columns are incremental use Z Ordering • Use partitions • Use i3 instance types with IO caching
  • 18. Asurion_Public Lessons Learned – Delta • Use S3 paths to register Delta tables in Hive • Generate manifest files and enable auto updates • delta.compatibility.symlinkFormatManifest.enabled = true • Spark and Presto views are not compatible at this time • Extract delta stats • Row count, last modified, table size
  • 19. Asurion_Public Lessons Learned – Delta • Streaming from Delta table in append mode • Streaming from Delta table when merging • a • Merging rewrites a lot of data • Delta will stream out the whole file • Use for each batch to filter data down based on the batchID
  • 20. Asurion_Public Lessons Learned – SQL Analytics • How are we using it • Collect metrics from APIs to Delta table • Only one meta store is allowed at this time • No UDF support • Learn to troubleshoot DAGs and Spark Jobs
  • 21. Q&A
  • 22. Asurion_Public Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.