Large Scale Lakehouse Implementation Using Structured Streaming

Large Scale Lakehouse
Implementation Using
Structured Streaming
Tomasz Magdanski
Sr Director – Data Platforms

Asurion_Public
Agenda
§ About Asurion
§ How did we get here
§ Scalable and cost-
effective job execution
§ Lessons Learned

Asurion helps people protect, connect
and enjoy the latest tech – to make life a
little easier. Every day our team of
10,000 Experts helps nearly 300 million
people around the world solve the most
common and uncommon tech issues.
We’re just a call, tap, click or visit away
for everything from getting a same-day
replacement of your smartphone, to
helping you stream or connect with no
buffering, bumps or bewilderment.
We think you should stay connected and
get the most from the tech you love… no
matter the type of tech or where you
purchased it.

Asurion_Public
Scope of work
▪ 4000+ source tables
▪ 4000+ L1 tables
▪ 3500+ L2 tables
▪ Streams
Kafka, Kinesis, SNS, SQS
▪ APIs
▪ Flat Files
▪ AWS, Azure and On Prem
• 300+ Data Warehouse
tables
• 600+ Data Marts
• Data Warehouse
• Ingestion
• 10,000+ Views
• 2,000+ Reports
• Consumption

Asurion_Public
Why Lakehouse ?
§ Lambda Architecture
§ D -1 latency
§ Limited Throughput
§ Hard to scale
§ Wide technology stack
§ Single pipeline
§ Near real time latency
§ Scalable with Apache Spark
§ Integrated ecosystem
§ Narrow technology stack
Lakehouse
Previous architecture

Asurion_Public
Pre-Prod Compute
Enhanced Data Flow
Production Data
AWS Prod Acct
Production Compute
AWS Pre-Prod
Acct

Asurion_Public
Job Execution
Ingestion Job
(Spark)
1st table
…….
4000th
table
• Spark Structured Streaming
• Unify the entry points
• S3 -> read with Autoloader
• Kafka -> read with Spark
• Use Databricks Jobs and Job
Clusters
• Single code base in Scala
• CICD Pipeline

Asurion_Public
Job Execution
Ingestion Job
(Spark)
streamingDF.writeStream.foreachBatch {
(batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(delta).mode(append).save(...) // append to L1
deltaTable.merge(batchDF, )…execute() // merge to L2
batchDF.unpersist()
}
• Spark Structured Streaming
• All target tables are Delta
• Append table (L1) - SCD2
• Merge table (L2) – SCD1

Asurion_Public
Trigger choice
▪ Databricks only allows 1000
jobs, and we have 4000 tables
▪ Best case scenario 4000 * 3
nodes = 12,000 nodes
• Up to 40 streams on a cluster
• Large clusters
• Huge compute waste for
infrequently updated tables
• Many streaming jobs per cluster
• One streaming job per cluster
• No continues execution
• Hundreds of jobs per cluster
• Job can migrate to new cluster
between executions
• Configs are refreshed at each run
• ML can be used to balance jobs
• Many trigger once jobs per cluster
Ingestion Job
Ingestion Job
Ingestion Job
Ingestion Job
Ingestion Job
Ingestion Job
Ingestion Job

Asurion_Public
Lessons Learned – Cloud Files
Cloud Files: S3 notification -> SNS -> SQS
• S3 notification limit: 100 per bucket
• SQS and SNS are not tagged by default
• SNS hard limits:
• ListSubscriptions 30 per second.
• ListSubscriptionsByTopic 30 per second.

Asurion_Public
Lessons Learned – Cloud Files
Pre-Prod
Compute
Production
Data
notification
SNS
SQS
Production
Compute
SQS
SNS

Asurion_Public
Lessons Learned – CDC and DMS - timestamps
CDC: Change Data Capture
• Load and CDC
• Earlier version of the row
may have latest timestamp
• Reset DMS Timestamp to 0
Load files (hours)
CDC files (minutes)

Asurion_Public
Lessons Learned – CDC and DMS - transformations
• DMS Data types conversation
• SQL Server: Tiny Int converted to UINT
• Oracle: Numeric is converted to DECIMAL(38,10), set
numberDataTypeScale=-2

Asurion_Public
Lessons Learned – CDC and DMS - other
• Load files can be large and cause skew in dataframe when read
• DMS files are NOT partitioned
• DMS files should be removed when task is restarted
• Set TargetTablePrepMode = DROP_AND_CREATE
• Some sources can have large transactions with many updates to the same row –
bring LSN in DMS job for deterministic merging
• If database table has no PKs but it has unique constraints with nulls – replace null
with string “null” for deterministic merging

Asurion_Public
Lessons Learned – Kafka
• Spark read from Kafka can be slow
• If topic doesn’t have large number of partitions and,
• Topic has a lot of data
• Set: minPartitions and maxOffsetsPerTrigger to high number to speed reading
• L2 read from L1 instead of source
• Actions take time in the above scenario. Optimize and use L1 as a source for merge
• BatchID: add it to the data

Asurion_Public
Lessons Learned – Kafka
• Stream all the data to Kafka first
• Bring data from SNS, SQS, Kinesis to Kafka using Kafka Connect
• Spark reader for Kafka supports Trigger once

Asurion_Public
Lessons Learned – Delta
• Optimize the table after initial load
• Use Optimized Writes after initial load
• delta.autoOptimize.optimizeWrite = true
• Move merge and batch id columns to the front of the dataframe
• If merge columns are incremental use Z Ordering
• Use partitions
• Use i3 instance types with IO caching

Asurion_Public
• Use S3 paths to register Delta tables in Hive
• Generate manifest files and enable auto updates
• delta.compatibility.symlinkFormatManifest.enabled = true
• Spark and Presto views are not compatible at this time
• Extract delta stats
• Row count, last modified, table size

Asurion_Public
• Streaming from Delta table in append mode
• Streaming from Delta table when merging
• a
• Merging rewrites a lot of data
• Delta will stream out the whole file
• Use for each batch to filter data down based on the batchID

Asurion_Public
Lessons Learned – SQL Analytics
• How are we using it
• Collect metrics from APIs to Delta table
• Only one meta store is allowed at this time
• No UDF support
• Learn to troubleshoot DAGs and Spark Jobs

Asurion_Public
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Large Scale Lakehouse Implementation Using Structured Streaming

More Related Content

What's hot (20)

Similar to Large Scale Lakehouse Implementation Using Structured Streaming (20)

More from Databricks (20)

Recently uploaded (20)

Large Scale Lakehouse Implementation Using Structured Streaming