Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture

Not Your Father’s Database:
How to Use Apache® Spark™ Properly
in Your Big Data Architecture

About Me
2005 Mobile Web & Voice Search
3

About Me
4
2012 Reporting & Analytics

About Me
5
2012 Reporting & Analytics
2014 Solutions Engineering

This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS

But the performance is very different…
Is this your Spark infrastructure?
7
HDFS

Just in Time Data Warehouse w/ Spark
HDFS

Just in Time Data Warehouse w/ Spark
and more…
HDFS

Separate Compute vs. Storage
11
Benefits:
• No need to import your data into Spark to begin
processing.
• Dynamically Scale Spark clusters to match compute
vs. storage needs.
• Choose the best data storage with different
performance characteristics for your use case.

12
Know when to use other data stores
besides file systems
Today’s Goal

Good: General Purpose Processing
Types of Data Sets to Store in FileSystems:
• Archival Data
• Unstructured Data
• Social Media and other web datasets
• Backup copies of data stores
14

Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Iterative Workloads
15

Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
16

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
17

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, but it’s not very efficient — Spark may have
to go through all your files to find your row.
18

Bad: Random Access
Solution: If you frequently randomlyaccess your
data, use a database.
• For traditional SQL databases, create an index
on your key column.
• Key-Value NOSQL stores retrieves the value
of a key efficiently out of the box.
19

Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new file:
• Inserts are reasonably fast.
• But querying will be slow…
20

Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spark SQL table files.
21

Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pattern” to duplicately store your
data.
22

Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locatetherow(s) in the files.
• Delete &Insert: Delete the old row and insert a new one.
• Update: Fileformats aren’t optimized for updating rows.
Solution:Manydatabasessupport efficient update operations.
23

Use Case: Up-to-date, liveviews of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
24
Incremental
SQL Query
Database
Snapshot
+

Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
25
HDFS

Bad: External Reporting w/ load
Too manyconcurrentrequestswill start to queueup.
26
HDFS

Solution: Write out to a DB as a cache to handle load.
Bad: External Reporting w/ load
27
HDFS
DB

28
Advanced Analytics and Data Science
Use Case:

Good: Machine Learning & Data Science
UseMLlib, GraphXandSparkpackagesformachine
learninganddatascience.
Benefits:
• Built in distributedalgorithms.
• In memorycapabilitiesfor iterativeworkloads.
• All in one solution:Data cleansing,featurization,
training, testing, serving,etc.
29

Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each row to find results.
30

31
Streaming and Realtime Analytics
Use Case:

Good: Periodic Scheduled Jobs
Schedule your workloads to run on a regular basis:
• Launch a dedicated cluster for important workloads.
• Output your results as reports or store to a
files/database.
• Poor Man’s Streaming: Spark is fast, so push the
interval to be frequent.
32

Bad: Low Latency Stream
Processing
Spark Streaming can detect new files dropped into a
folder to process, but there is a delay to build up a
whole file’s worth of data.
Solution: Send data to message queues not files.
33

Not Your Father’s Database:
How to Use Apache Spark Properly
in Your Big Data Architecture
SparkSummit East2016

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture (20)

More from Databricks (20)

Recently uploaded (20)

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture