SlideShare a Scribd company logo
Massive Data
Processing in Adobe
using Delta Lake
Yeshwanth Vijayakumar
Sr. Engineering Manager/Architect @ Adobe
Agenda
§ Introduction
§ What are we storing?
§ Data Representation and
Nested Schema Evolution
§ Writer Worries and How to
Wipe them Away
§ Staging Tables FTW
§ Datalake Replication Lag
Tracking
§ Performance Time!
Unified Profile Data Ingestion
Unified Profile
Experience Data Model
Adobe Campaign
AEM
Adobe Analytics
Adobe
AdCloud
Change Feed Streaming
Stats Generation
Single Tenant
Multi Tenant
Linking Identities
Data Layout At a Glance
An Idea about how the graph linkages are stored
primaryId relatedIds field1 field2 field1000
123 123 a b c
456 456 d e f
123 123 d e l
789 789,101 x y z
101 789,101 x u p
Conditions
• primaryId does not change
• relatedIds can change
New Record comes in
Indicating a new linkage, causing a change in graph membership
primaryId relatedId field1 field2 field1000
103 103,789,101 q w r
789 103,789,101 x y z
101 103,789,101 x y z
primaryId relatedId field1 field2 field1000
103 103,789,101 q w r
New Record comes in linking 103 with 789 and 101
Causes a cascading change in rows of 789 and 101
Main Access Pattern
Query 1
Query 2
Query 3
Query 1000
Multiple Queries over 1 consolidated row
Complexities?
• Nested Fields
• a.b.c.d[*].e nested hairiness!
• Arrays!
• MapType
• Every Tenant has a different Schema!
• Schema evolves constantly
• Fields can get deleted, updated.
• Multiple Sources
• Streaming
• Batch
Scale?
• Tenants have 10+ Billions of rows
• PBs of data
• Million RPS peak across the system
• Triggers multiple downstream applications
• Segmentation
• Activation
What is DeltaLake?
From delta.io : Delta Lake is an open-source project that enables building a Lakehouse architecture
on top of existing storage systems such as S3, ADLS, GCS, and HDFS.
ACID
Transactions
Time Travel
(data
versioning)
Uses Parquet
Underneath
Schema
Enforcement
and Schema
Evolution
Audit History
Updates and
Deletes Support
Key Features
Delta lake in Practice
UPSERT
SQL Compatible
Writer Worries and How to Wipe them Away
• Concurrency Conflicts
• Column size
• When individual column data exceeds 2GB, we see degradation in writes or OOM
• Update frequency
• Too frequent updates cause underlying filestore metadata issues.
• This is because every transacation on an individual parquet causes CoW,
• More updates => more rewrites on HDFS
• Too Many small files !!!
CDC (existing)
Batch Ingestion / Streaming
Ingestion /
API based Ingest
Mutation Apps
CosmosDB
CDC
1. Send Request to
Cosmos
2.Ack
3.Emit CDC
Consumed by
• Stats
• Edge
• etc
Dataflow with DeltaLake
primary
Id
relatedId
field
1
field2 field1000
103 103,789,101 q w r
789 103,789,101 x y z
101 103,789,101 x y z
Cosmos
DB
primaryId relatedId field1 field1000
103 103,789,101 q r
primaryId relatedId jsonString
103 103,789,101 <jsonStr>
789 103,789,101
<jsonStr>
101 103,789,101 <jsonStr>
Staging Table
Change Feed CDC
Raw Table (per tenant)
Check for Work every
X minutes
UPSERT/DELETE into
Raw Table
Fetch
Records
to process
APPEND only!
CDC
Dumper
Backfill
Long Running
Streaming
Application
Processor
Partitioned by tenant and 15 min time intervals
TenantLock in Redis
Staging Tables FTW
Fan-In pattern vs Fan-out
• Multiple Source Writers Issue Solved
• By centralizing all reads from CDC, since ALL writes generate a CDC
• Staging Table in APPEND ONLY mode
• No conflicts while writing to it
• Filter out. Bad data > thresholds before making it to Raw
Table
• Batch Writes by reading larger blocks of data from Staging
Table
• Since it acts time aware message buffer
Staging Table Logical View
ProgressMap
Why choose JSON String format?
§ We are doing a lazy Schema on-read approach.
▪ Yes. this is an anti-pattern.
§ Nested Schema Evolution was not supported on update in delta in 2020
▪ Supported with latest version
§ We want to apply conflict resolution before upsert-ing
▪ Eg. resolveAndMerge(newData, oldData)
▪ UDF’s are strict on types, with the plethora of difference schemas , it is crazy to manage UDF per
org in Multi tenant fashion
▪ Now we just have simple JSON merge udfs
▪ We use json-iter which is very efficient in loading partial bits of json and in manipulating them.
§ Don’t you lose predicate pushdown?
▪ We have pulled out all main push-down filters to individual columns
▪ Eg. timestamp, recordType, id, etc.
▪ Profile workloads are mainly scan based since we can run 1000’s of queries at a single time.
▪ Reading the whole JSON string from datalake is much faster and cheaper than reading from
Cosmos for 20% of all fields.
Schema On Read is more
future safe approach for
raw data
§ Wrangling Spark Structs is not
user friendly
§ JSON schema is messy
▪ Crazy nesting
▪ Add maps to the equation, just the
schema will be in MBs
§ Schema on Read using Json-iter
means we can read what we
need on a row by row basis
§ Materialized Views WILL have
structs!
Partition Scheme of Raw records
• RawRecords Delta Table
• recordType
• sourceId
• timestamp (key-value records will use DEFAULT value)
z-order on primaryId
z-order - Colocate column information in the same set of files using locality-preserving space-filling curves
Massive Data Processing in Adobe Using Delta Lake
Replication Lag – 2 types
• CDC Lag from Kafka
• Tells us how much more work we need to do to catch up to write to Staging
Table
• How we track Lag on a per tenant basis
• We track Max(TimeStamp) in CDC per org
• We track Max(TSKEY) processed in Processor
• Difference gives us rough lag of replication
Merge/UPSERT Performance
Action: UPSERT CDC stage into fragment
Time Taken
170 K CDC Records – Maps to 100k
Rows in Raw Table
15 seconds
1.7 Million CDC Records – Maps to 1
Million Rows in Raw Table
61 seconds
Live Traffic Usecase: How long does it take X CDC messages to get upserted into Raw Table
Job Performance Time!
Hot Store (NoSQL Store) Delta Lake
Size of Data 1 TB 64 GB
Number of Partitions 80 189
Job Cores Used 112 112
Job Runtime 3 hours 25 mins
TakeAways
• Scan IO speed from datalake >>> Read from Hot Store
• Reasonably fast eventually consistent replication within
minutes
• More partitions means better Spark executor core utilization
• Potential to aggressively TTL data in hot store
• More downstream materialization !!!
• Incremental Computation Framework thanks to Staging tables!
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

PPTX
Free Training: How to Build a Lakehouse
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Data Lakehouse Symposium | Day 4
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PPTX
Master the Multi-Clustered Data Warehouse - Snowflake
PPTX
Databricks Fundamentals
PDF
Intro to Delta Lake
Free Training: How to Build a Lakehouse
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Data Lakehouse Symposium | Day 4
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Master the Multi-Clustered Data Warehouse - Snowflake
Databricks Fundamentals
Intro to Delta Lake

What's hot (20)

PDF
Making Apache Spark Better with Delta Lake
PPTX
Microsoft Azure Databricks
PDF
Building large scale transactional data lake using apache hudi
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
PPTX
Snowflake Datawarehouse Architecturing
PDF
Introduction SQL Analytics on Lakehouse Architecture
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PDF
Diving into Delta Lake: Unpacking the Transaction Log
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
The delta architecture
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Building an open data platform with apache iceberg
PDF
Apache flink
PDF
Achieving Lakehouse Models with Spark 3.0
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Modernizing to a Cloud Data Architecture
Making Apache Spark Better with Delta Lake
Microsoft Azure Databricks
Building large scale transactional data lake using apache hudi
Making Data Timelier and More Reliable with Lakehouse Technology
Snowflake Datawarehouse Architecturing
Introduction SQL Analytics on Lakehouse Architecture
Data Lakehouse Symposium | Day 1 | Part 2
Diving into Delta Lake: Unpacking the Transaction Log
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
The delta architecture
Building Lakehouses on Delta Lake with SQL Analytics Primer
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Building an open data platform with apache iceberg
Apache flink
Achieving Lakehouse Models with Spark 3.0
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Modernizing to a Cloud Data Architecture
Ad

Similar to Massive Data Processing in Adobe Using Delta Lake (20)

PDF
System design handwritten notes guidance
PDF
System Design.pdf
PDF
Understanding and building big data Architectures - NoSQL
PPTX
NewSQL - Deliverance from BASE and back to SQL and ACID
PDF
Healthcare Claim Reimbursement using Apache Spark
PDF
New Developments in Spark
PPTX
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
PDF
Building and deploying large scale real time news system with my sql and dist...
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
What's New in Apache Hive
PPTX
Silk_SQLSaturdayBatonRouge_kgorman_2024.pptx
PPTX
Delta lake and the delta architecture
PDF
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
PPTX
TenMax Data Pipeline Experience Sharing
PPTX
SQL Server It Just Runs Faster
PDF
Re-Engineering PostgreSQL as a Time-Series Database
PDF
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
PDF
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
System design handwritten notes guidance
System Design.pdf
Understanding and building big data Architectures - NoSQL
NewSQL - Deliverance from BASE and back to SQL and ACID
Healthcare Claim Reimbursement using Apache Spark
New Developments in Spark
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Building and deploying large scale real time news system with my sql and dist...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
What's New in Apache Hive
Silk_SQLSaturdayBatonRouge_kgorman_2024.pptx
Delta lake and the delta architecture
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
TenMax Data Pipeline Experience Sharing
SQL Server It Just Runs Faster
Re-Engineering PostgreSQL as a Time-Series Database
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPTX
Data Lakehouse Symposium | Day 2
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Machine Learning CI/CD for Email Attack Detection
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
PDF
Improving Apache Spark for Dynamic Allocation and Spot Instances
PDF
Importance of ML Reproducibility & Applications with MLfLow
PDF
Hyperspace for Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 2
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Machine Learning CI/CD for Email Attack Detection
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Infrastructure Agnostic Machine Learning Workload Deployment
Improving Apache Spark for Dynamic Allocation and Spot Instances
Importance of ML Reproducibility & Applications with MLfLow
Hyperspace for Delta Lake

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Managing Community Partner Relationships
PPTX
A Complete Guide to Streamlining Business Processes
PDF
annual-report-2024-2025 original latest.
PDF
Microsoft Core Cloud Services powerpoint
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
How to run a consulting project- client discovery
PDF
Lecture1 pattern recognition............
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Introduction to the R Programming Language
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Introduction to Data Science and Data Analysis
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Managing Community Partner Relationships
A Complete Guide to Streamlining Business Processes
annual-report-2024-2025 original latest.
Microsoft Core Cloud Services powerpoint
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ISS -ESG Data flows What is ESG and HowHow
IBA_Chapter_11_Slides_Final_Accessible.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
How to run a consulting project- client discovery
Lecture1 pattern recognition............
Data_Analytics_and_PowerBI_Presentation.pptx
Mega Projects Data Mega Projects Data
[EN] Industrial Machine Downtime Prediction
Introduction to the R Programming Language
STERILIZATION AND DISINFECTION-1.ppthhhbx
DATA COLLECTION METHODS-ppt for nursing research
Introduction to Data Science and Data Analysis

Massive Data Processing in Adobe Using Delta Lake

  • 1. Massive Data Processing in Adobe using Delta Lake Yeshwanth Vijayakumar Sr. Engineering Manager/Architect @ Adobe
  • 2. Agenda § Introduction § What are we storing? § Data Representation and Nested Schema Evolution § Writer Worries and How to Wipe them Away § Staging Tables FTW § Datalake Replication Lag Tracking § Performance Time!
  • 3. Unified Profile Data Ingestion Unified Profile Experience Data Model Adobe Campaign AEM Adobe Analytics Adobe AdCloud Change Feed Streaming Stats Generation Single Tenant Multi Tenant
  • 5. Data Layout At a Glance An Idea about how the graph linkages are stored primaryId relatedIds field1 field2 field1000 123 123 a b c 456 456 d e f 123 123 d e l 789 789,101 x y z 101 789,101 x u p Conditions • primaryId does not change • relatedIds can change
  • 6. New Record comes in Indicating a new linkage, causing a change in graph membership primaryId relatedId field1 field2 field1000 103 103,789,101 q w r 789 103,789,101 x y z 101 103,789,101 x y z primaryId relatedId field1 field2 field1000 103 103,789,101 q w r New Record comes in linking 103 with 789 and 101 Causes a cascading change in rows of 789 and 101
  • 7. Main Access Pattern Query 1 Query 2 Query 3 Query 1000 Multiple Queries over 1 consolidated row
  • 8. Complexities? • Nested Fields • a.b.c.d[*].e nested hairiness! • Arrays! • MapType • Every Tenant has a different Schema! • Schema evolves constantly • Fields can get deleted, updated. • Multiple Sources • Streaming • Batch
  • 9. Scale? • Tenants have 10+ Billions of rows • PBs of data • Million RPS peak across the system • Triggers multiple downstream applications • Segmentation • Activation
  • 10. What is DeltaLake? From delta.io : Delta Lake is an open-source project that enables building a Lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS. ACID Transactions Time Travel (data versioning) Uses Parquet Underneath Schema Enforcement and Schema Evolution Audit History Updates and Deletes Support Key Features
  • 11. Delta lake in Practice UPSERT SQL Compatible
  • 12. Writer Worries and How to Wipe them Away • Concurrency Conflicts • Column size • When individual column data exceeds 2GB, we see degradation in writes or OOM • Update frequency • Too frequent updates cause underlying filestore metadata issues. • This is because every transacation on an individual parquet causes CoW, • More updates => more rewrites on HDFS • Too Many small files !!!
  • 13. CDC (existing) Batch Ingestion / Streaming Ingestion / API based Ingest Mutation Apps CosmosDB CDC 1. Send Request to Cosmos 2.Ack 3.Emit CDC Consumed by • Stats • Edge • etc
  • 14. Dataflow with DeltaLake primary Id relatedId field 1 field2 field1000 103 103,789,101 q w r 789 103,789,101 x y z 101 103,789,101 x y z Cosmos DB primaryId relatedId field1 field1000 103 103,789,101 q r primaryId relatedId jsonString 103 103,789,101 <jsonStr> 789 103,789,101 <jsonStr> 101 103,789,101 <jsonStr> Staging Table Change Feed CDC Raw Table (per tenant) Check for Work every X minutes UPSERT/DELETE into Raw Table Fetch Records to process APPEND only! CDC Dumper Backfill Long Running Streaming Application Processor Partitioned by tenant and 15 min time intervals TenantLock in Redis
  • 15. Staging Tables FTW Fan-In pattern vs Fan-out • Multiple Source Writers Issue Solved • By centralizing all reads from CDC, since ALL writes generate a CDC • Staging Table in APPEND ONLY mode • No conflicts while writing to it • Filter out. Bad data > thresholds before making it to Raw Table • Batch Writes by reading larger blocks of data from Staging Table • Since it acts time aware message buffer
  • 16. Staging Table Logical View ProgressMap
  • 17. Why choose JSON String format? § We are doing a lazy Schema on-read approach. ▪ Yes. this is an anti-pattern. § Nested Schema Evolution was not supported on update in delta in 2020 ▪ Supported with latest version § We want to apply conflict resolution before upsert-ing ▪ Eg. resolveAndMerge(newData, oldData) ▪ UDF’s are strict on types, with the plethora of difference schemas , it is crazy to manage UDF per org in Multi tenant fashion ▪ Now we just have simple JSON merge udfs ▪ We use json-iter which is very efficient in loading partial bits of json and in manipulating them. § Don’t you lose predicate pushdown? ▪ We have pulled out all main push-down filters to individual columns ▪ Eg. timestamp, recordType, id, etc. ▪ Profile workloads are mainly scan based since we can run 1000’s of queries at a single time. ▪ Reading the whole JSON string from datalake is much faster and cheaper than reading from Cosmos for 20% of all fields.
  • 18. Schema On Read is more future safe approach for raw data § Wrangling Spark Structs is not user friendly § JSON schema is messy ▪ Crazy nesting ▪ Add maps to the equation, just the schema will be in MBs § Schema on Read using Json-iter means we can read what we need on a row by row basis § Materialized Views WILL have structs!
  • 19. Partition Scheme of Raw records • RawRecords Delta Table • recordType • sourceId • timestamp (key-value records will use DEFAULT value) z-order on primaryId z-order - Colocate column information in the same set of files using locality-preserving space-filling curves
  • 21. Replication Lag – 2 types • CDC Lag from Kafka • Tells us how much more work we need to do to catch up to write to Staging Table • How we track Lag on a per tenant basis • We track Max(TimeStamp) in CDC per org • We track Max(TSKEY) processed in Processor • Difference gives us rough lag of replication
  • 22. Merge/UPSERT Performance Action: UPSERT CDC stage into fragment Time Taken 170 K CDC Records – Maps to 100k Rows in Raw Table 15 seconds 1.7 Million CDC Records – Maps to 1 Million Rows in Raw Table 61 seconds Live Traffic Usecase: How long does it take X CDC messages to get upserted into Raw Table
  • 23. Job Performance Time! Hot Store (NoSQL Store) Delta Lake Size of Data 1 TB 64 GB Number of Partitions 80 189 Job Cores Used 112 112 Job Runtime 3 hours 25 mins
  • 24. TakeAways • Scan IO speed from datalake >>> Read from Hot Store • Reasonably fast eventually consistent replication within minutes • More partitions means better Spark executor core utilization • Potential to aggressively TTL data in hot store • More downstream materialization !!! • Incremental Computation Framework thanks to Staging tables!
  • 25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.