SlideShare a Scribd company logo
Not Your Father’s Database:
How to Use Apache® Spark™ Properly
in Your Big Data Architecture
Not Your Father’s Database:
How to Use Apache® Spark™ Properly
in Your Big Data Architecture
About Me
2005 Mobile Web & Voice Search
3
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering
This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
But the performance is very different…
Is this your Spark infrastructure?
7
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
HDFS
Just in Time Data Warehouse w/ Spark
and more…
HDFS
Separate Compute vs. Storage
11
Benefits:
• No need to import your data into Spark to begin
processing.
• Dynamically Scale Spark clusters to match compute
vs. storage needs.
• Choose the best data storage with different
performance characteristics for your use case.
12
Know when to use other data stores
besides file systems
Today’s Goal
13
Data Warehousing
Use Case:
Good: General Purpose Processing
Types of Data Sets to Store in FileSystems:
• Archival Data
• Unstructured Data
• Social Media and other web datasets
• Backup copies of data stores
14
Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Iterative Workloads
15
Good: General Purpose Processing
Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
16
Good: General Purpose Processing
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
17
Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, but it’s not very efficient — Spark may have
to go through all your files to find your row.
18
Bad: Random Access
Solution: If you frequently randomlyaccess your
data, use a database.
• For traditional SQL databases, create an index
on your key column.
• Key-Value NOSQL stores retrieves the value
of a key efficiently out of the box.
19
Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new file:
• Inserts are reasonably fast.
• But querying will be slow…
20
Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spark SQL table files.
21
Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pattern” to duplicately store your
data.
22
Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locatetherow(s) in the files.
• Delete &Insert: Delete the old row and insert a new one.
• Update: Fileformats aren’t optimized for updating rows.
Solution:Manydatabasessupport efficient update operations.
23
Use Case: Up-to-date, liveviews of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
24
Incremental
SQL Query
Database
Snapshot
+
Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
25
HDFS
Bad: External Reporting w/ load
Too manyconcurrentrequestswill start to queueup.
26
HDFS
Solution: Write out to a DB as a cache to handle load.
Bad: External Reporting w/ load
27
HDFS
DB
28
Advanced Analytics and Data Science
Use Case:
Good: Machine Learning & Data Science
UseMLlib, GraphXandSparkpackagesformachine
learninganddatascience.
Benefits:
• Built in distributedalgorithms.
• In memorycapabilitiesfor iterativeworkloads.
• All in one solution:Data cleansing,featurization,
training, testing, serving,etc.
29
Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each row to find results.
30
31
Streaming and Realtime Analytics
Use Case:
Good: Periodic Scheduled Jobs
Schedule your workloads to run on a regular basis:
• Launch a dedicated cluster for important workloads.
• Output your results as reports or store to a
files/database.
• Poor Man’s Streaming: Spark is fast, so push the
interval to be frequent.
32
Bad: Low Latency Stream
Processing
Spark Streaming can detect new files dropped into a
folder to process, but there is a delay to build up a
whole file’s worth of data.
Solution: Send data to message queues not files.
33
Thank you
Not Your Father’s Database:
How to Use Apache Spark Properly
in Your Big Data Architecture
SparkSummit East2016

More Related Content

PDF
Spark streaming state of the union
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
Distributed ML in Apache Spark
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Enabling exploratory data science with Spark and R
PDF
Spark DataFrames and ML Pipelines
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Spark streaming state of the union
Spark Application Carousel: Highlights of Several Applications Built with Spark
Distributed ML in Apache Spark
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Enabling exploratory data science with Spark and R
Spark DataFrames and ML Pipelines
Spark streaming State of the Union - Strata San Jose 2015
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

What's hot (20)

PDF
New Developments in Spark
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
A look ahead at spark 2.0
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
PDF
Lessons from Running Large Scale Spark Workloads
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PDF
Composable Parallel Processing in Apache Spark and Weld
PPTX
Building a modern Application with DataFrames
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
PDF
New directions for Apache Spark in 2015
PDF
Designing Distributed Machine Learning on Apache Spark
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
New Developments in Spark
Real-Time Spark: From Interactive Queries to Streaming
Large-Scale Data Science in Apache Spark 2.0
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Jump Start into Apache® Spark™ and Databricks
A look ahead at spark 2.0
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Under the Hood - Meetup @ Data Science London
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Lessons from Running Large Scale Spark Workloads
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Composable Parallel Processing in Apache Spark and Weld
Building a modern Application with DataFrames
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
New directions for Apache Spark in 2015
Designing Distributed Machine Learning on Apache Spark
GraphFrames: DataFrame-based graphs for Apache® Spark™
Ad

Viewers also liked (20)

PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PPTX
Parallelizing Existing R Packages with SparkR
PDF
The Future of Real-Time in Spark
PDF
Spark Summit Europe 2016 Keynote - Databricks CEO
PPTX
Use r tutorial part1, introduction to sparkr
PPTX
Apache Spark and Online Analytics
PPTX
Combining Machine Learning Frameworks with Apache Spark
PPTX
Apache Spark Model Deployment
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PPTX
Introduction to Apache Spark Developer Training
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PPT
5 How The Model Works (With Notes)
PDF
Foundations for Scaling ML in Apache Spark
PPTX
Spark etl
PDF
New Directions for Spark in 2015 - Spark Summit East
PDF
How mentoring can help you start contributing to open source
PDF
Luciano Resende's keynote at Apache big data conference
PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
PDF
SystemML - Declarative Machine Learning
Apache Spark 2.0: Faster, Easier, and Smarter
Parallelizing Existing R Packages with SparkR
The Future of Real-Time in Spark
Spark Summit Europe 2016 Keynote - Databricks CEO
Use r tutorial part1, introduction to sparkr
Apache Spark and Online Analytics
Combining Machine Learning Frameworks with Apache Spark
Apache Spark Model Deployment
Spark Summit EU 2015: Lessons from 300+ production users
Introduction to Apache Spark Developer Training
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark MLlib 2.0 Preview: Data Science and Production
5 How The Model Works (With Notes)
Foundations for Scaling ML in Apache Spark
Spark etl
New Directions for Spark in 2015 - Spark Summit East
How mentoring can help you start contributing to open source
Luciano Resende's keynote at Apache big data conference
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
SystemML - Declarative Machine Learning
Ad

Similar to Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture (20)

PDF
Not Your Father's Database by Vida Ha
PDF
Not Your Father's Database by Databricks
PDF
Started with-apache-spark
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PDF
How to get started in Big Data for master's students
PDF
Spark For The Business Analyst
PDF
RDBMS vs Hadoop vs Spark
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
PPTX
Spark - Migration Story
PDF
Apache Spark 101 - Demi Ben-Ari
PDF
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
PDF
Apache Spark 101 - Demi Ben-Ari - Panorays
PDF
Big data rmoug
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
PDF
BigData Behind-the-Scenes~20150827
PDF
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
PDF
Apache spark its place within a big data stack
PPTX
Demystifying data engineering
PDF
Learning Spark Lightningfast Data Analytics 2nd Edition Jules S Damji
Not Your Father's Database by Vida Ha
Not Your Father's Database by Databricks
Started with-apache-spark
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to get started in Big Data for master's students
Spark For The Business Analyst
RDBMS vs Hadoop vs Spark
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Spark - Migration Story
Apache Spark 101 - Demi Ben-Ari
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Apache Spark 101 - Demi Ben-Ari - Panorays
Big data rmoug
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
BigData Behind-the-Scenes~20150827
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers
Apache spark its place within a big data stack
Demystifying data engineering
Learning Spark Lightningfast Data Analytics 2nd Edition Jules S Damji

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
MIND Revenue Release Quarter 2 2025 Press Release
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Tartificialntelligence_presentation.pptx
Machine Learning_overview_presentation.pptx
Encapsulation theory and applications.pdf
Encapsulation_ Review paper, used for researhc scholars
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture

  • 1. Not Your Father’s Database: How to Use Apache® Spark™ Properly in Your Big Data Architecture
  • 2. Not Your Father’s Database: How to Use Apache® Spark™ Properly in Your Big Data Architecture
  • 3. About Me 2005 Mobile Web & Voice Search 3
  • 4. About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics
  • 5. About Me 2005 Mobile Web & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering
  • 6. This system talks like a SQL Database… Is this your Spark infrastructure? 6 HDFS
  • 7. But the performance is very different… Is this your Spark infrastructure? 7 HDFS
  • 8. Just in Time Data Warehouse w/ Spark HDFS
  • 9. Just in Time Data Warehouse w/ Spark HDFS
  • 10. Just in Time Data Warehouse w/ Spark and more… HDFS
  • 11. Separate Compute vs. Storage 11 Benefits: • No need to import your data into Spark to begin processing. • Dynamically Scale Spark clusters to match compute vs. storage needs. • Choose the best data storage with different performance characteristics for your use case.
  • 12. 12 Know when to use other data stores besides file systems Today’s Goal
  • 14. Good: General Purpose Processing Types of Data Sets to Store in FileSystems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 14
  • 15. Types of workloads • Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 15 Good: General Purpose Processing
  • 16. Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale 16 Good: General Purpose Processing
  • 17. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? 17
  • 18. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have to go through all your files to find your row. 18
  • 19. Bad: Random Access Solution: If you frequently randomlyaccess your data, use a database. • For traditional SQL databases, create an index on your key column. • Key-Value NOSQL stores retrieves the value of a key efficiently out of the box. 19
  • 20. Bad: Frequent Inserts sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 20
  • 21. Bad: Frequent Inserts Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 21
  • 22. Good: Data Transformation/ETL Use Spark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 22
  • 23. Bad: Frequent/Incremental Updates Update statements — not supported yet. Why not? • Random Access: Locatetherow(s) in the files. • Delete &Insert: Delete the old row and insert a new one. • Update: Fileformats aren’t optimized for updating rows. Solution:Manydatabasessupport efficient update operations. 23
  • 24. Use Case: Up-to-date, liveviews of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 24 Incremental SQL Query Database Snapshot +
  • 25. Good: Connecting BI Tools Tip: Cache your tables for optimal performance. 25 HDFS
  • 26. Bad: External Reporting w/ load Too manyconcurrentrequestswill start to queueup. 26 HDFS
  • 27. Solution: Write out to a DB as a cache to handle load. Bad: External Reporting w/ load 27 HDFS DB
  • 28. 28 Advanced Analytics and Data Science Use Case:
  • 29. Good: Machine Learning & Data Science UseMLlib, GraphXandSparkpackagesformachine learninganddatascience. Benefits: • Built in distributedalgorithms. • In memorycapabilitiesfor iterativeworkloads. • All in one solution:Data cleansing,featurization, training, testing, serving,etc. 29
  • 30. Bad: Searching Content w/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 30
  • 31. 31 Streaming and Realtime Analytics Use Case:
  • 32. Good: Periodic Scheduled Jobs Schedule your workloads to run on a regular basis: • Launch a dedicated cluster for important workloads. • Output your results as reports or store to a files/database. • Poor Man’s Streaming: Spark is fast, so push the interval to be frequent. 32
  • 33. Bad: Low Latency Stream Processing Spark Streaming can detect new files dropped into a folder to process, but there is a delay to build up a whole file’s worth of data. Solution: Send data to message queues not files. 33
  • 35. Not Your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture SparkSummit East2016