SlideShare a Scribd company logo
Using Apache Spark and MySQL
for Data Analysis
Alexander Rubin, Sveta Smirnova
Percona
February, 4, 2017
www.percona.com
Agenda
• Why Spark?
• Spark Examples
– Wikistats analysis with Spark
www.percona.com
Data /
SQL / Protocol
SQL/
App
What is Spark anyway?
Nodes
Parallel Compute only
Local
FS
?
www.percona.com
• In memory processing with caching
• Massively Parallel
• Direct access to data sources (i.e.MySQL)
>>> df = sqlContext.load(source="jdbc",
url="jdbc:mysql://localhost?user=root",
dbtable="ontime.ontime_sm”)
• Can store data in Hadoop HDFS / S3 /
local Filesystem
• Native Python and R integration
Why Spark?
www.percona.com
Spark vs MySQL
www.percona.com
Spark vs. MySQL for BigData
Indexes
Partitioning
“Sharding”
Full table scan
Partitioning
Map/Reduce
www.percona.com
Spark (vs. MySQL)
• No indexes
• All processing is full scan
• BUT: distributed and parallel
• No transactions
• High latency (usually)
MySQL:
1 query = 1 CPU core
www.percona.com
Indexes (BTree) for Big Data
challenge
• Creating an index for Petabytes of data?
• Updating an index for Petabytes of data?
• Reading a terabyte index?
• Random read of Petabyte?
Full scan in parallel is better for big data
www.percona.com
ETL / Pipeline
1. Extract data from
external source
2. Transform before
loading
3. Load data into
MySQL
1. Extract data from
external source
2. Load data or rsync to
all spark nodes
3. Transform
data/Analyze
data/Visualize data;
Parallelism
www.percona.com
Schema on Read
Schema on Write
• Load data infile will
verify the input (validate)
• … indirect data
conversion
• ... or fail if number of
cols is wrong
Schema on Read
• No “load data” per se,
nothing to validate here
• … Create external table or
read csv
• ... will validate on “read”/
select
www.percona.com
Example:
Loading wikistat into MySQL
1. Extract data
from external
source and
uncompress!
2. Load data into
MySQL and
Transform
Wikipedia page counts –
download, >10TB
load data local infile '$file'
into table wikistats.wikistats_full
CHARACTER SET latin1
FIELDS TERMINATED BY ' '
(project_name, title, num_requests,
content_size)
set request_date =
STR_TO_DATE('$datestr',
'%Y%m%d %H%i%S'),
title_md5=unhex(md5(title));
http://dumps.wikimedia.org/other/pagecounts-raw/
www.percona.com
Load timing per hour of wikistat
• InnoDB: 52.34 sec
• MyISAM: 11.08 sec (+ indexes)
• 1 hour of wikistats =1 minute
• 1 year will load in 6 days
– (8765.81 hours in 1 year)
• 6 year = > 1 month to load
Not even counting
the insert time
degradation…
www.percona.com
Loading wikistat as is into
Spark
• Just copy files to storage (AWS S3 / local /
etc)…
– And create SQL structure
• Or read csv, aggregate/filter in Spark and
– load the aggregated data into MySQL
www.percona.com
Loading wikistat as is into
Spark
• How fast to search?
– Depends upon the number of nodes
• 1000 nodes spark cluster
– 4.5 TB, 104 Billion records
– Exec time: 45 sec
– Scanning 4.5TB of data
• http://spark-summit.org/wp-content/uploads/2014/07/Building-
1000-node-Spark-Cluster-on-EMR.pdf
www.percona.com
Pipelines: MySQL vs Spark
www.percona.com
Spark and WikiStats: load pipeline
Row(project=p[0],
url=urllib.unquote(p[1]).lower(),
num_requests=int(p[2]),
content_size=int(p[3])))
www.percona.com
Save results to MySQL
group_res = sqlContext.sql(
"SELECT '"+ mydate + "' as mydate,
url,
count(*) as cnt,
sum(num_requests) as tot_visits
FROM wikistats
GROUP BY url")
# Save to MySQL
mysql_url="jdbc:mysql://localhost?user=wikistats&password=
wikistats”
group_res.write.jdbc(url=mysql_url,
table="wikistats.wikistats_by_day_spark",
mode="append")
www.percona.com
Multi-Threaded Inserts
www.percona.com
PySpark: CPU
Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st
Cpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
...
Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers
www.percona.com
Monitoring your jobs
www.percona.com
www.percona.com
mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM
wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like
'%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by
lower(url) order by max_visits desc limit 10;
+--------------------------------------------------------+------------+----------+
| lurl | max_visits | count(*) |
+--------------------------------------------------------+------------+----------+
| heath_ledger | 4247338 | 131 |
| cloverfield | 3846404 | 131 |
| barack_obama | 2238406 | 153 |
| 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 |
| the_dark_knight_(film) | 1417186 | 64 |
| martin_luther_king,_jr. | 1394934 | 136 |
| deaths_in_2008 | 1372510 | 67 |
| united_states | 1357253 | 167 |
| scientology | 1349654 | 108 |
| portal:current_events | 1261538 | 125 |
+--------------------------------------------------------+------------+----------+
10 rows in set (1 hour 22 min 10.02 sec)
Search the WikiStats in MySQL
10 most frequently queried wiki pages in January 2008
www.percona.com
Search the WikiStats in SparkSQL
spark-sql> CREATE TEMPORARY TABLE wikistats_parquet
USING org.apache.spark.sql.parquet
OPTIONS (
path "/ssd/wikistats_parquet_bydate"
);
Time taken: 3.466 seconds
spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM
wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%'
and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url)
order by max_visits desc limit 10;
heath_ledger 4247335 42
cloverfield 3846400 42
barack_obama 2238402 53
1925_in_baseball#negro_league_baseball_final_standings 1791341 11
the_dark_knight_(film) 1417183 36
martin_luther_king,_jr. 1394934 46
deaths_in_2008 1372510 38
united_states 1357251 55
scientology 1349650 44
portal:current_events 1261305 44
Time taken: 1239.014 seconds, Fetched 10 row(s)
10 most frequently queried wiki pages in January 2008
20 min
www.percona.com
Apache Drill
Treat any datasource
as a table (even it is
not)
Querying MongoDB
with SQL
www.percona.com
Magic?
!=
www.percona.com
Recap…
1. Search full dataset
• May be pre-filtered
• Not aggregated
2. No parallelism
3. Based on index?
4. InnoDB<> Columnar
5. Partitioning?
1. Dataset is already
– Filtered (only site=“en”)
– Aggregated (group by url)
2. Parallelism (+)
3. Not Based on index
4. Columnar (+)
5. Partitioning (+)
www.percona.com
Thank you!
https://www.linkedin.com/in/alexanderrubin
Alexander Rubin
Ad

Recommended

MVVM - Model View ViewModel
MVVM - Model View ViewModel
Dareen Alhiyari
 
[Retail & CPG Day 2019] 유통 고객의 AWS 도입 동향 - 박동국, AWS 어카운트 매니저, 김준성, AWS어카운트 매니저
[Retail & CPG Day 2019] 유통 고객의 AWS 도입 동향 - 박동국, AWS 어카운트 매니저, 김준성, AWS어카운트 매니저
Amazon Web Services Korea
 
Introduction to Mulesoft
Introduction to Mulesoft
venkata20k
 
Pengenalan Framework CodeIgniter
Pengenalan Framework CodeIgniter
I Putu Arya Dharmaadi
 
S3, 넌 이것까지 할 수있네 (Amazon S3 신규 기능 소개) - 김세준, AWS 솔루션즈 아키텍트:: AWS Summit Onli...
S3, 넌 이것까지 할 수있네 (Amazon S3 신규 기능 소개) - 김세준, AWS 솔루션즈 아키텍트:: AWS Summit Onli...
Amazon Web Services Korea
 
비용 관점에서 AWS 클라우드 아키텍처 디자인하기::류한진::AWS Summit Seoul 2018
비용 관점에서 AWS 클라우드 아키텍처 디자인하기::류한진::AWS Summit Seoul 2018
Amazon Web Services Korea
 
Cloud-Native Fundamentals: An Introduction to 12-Factor Applications
Cloud-Native Fundamentals: An Introduction to 12-Factor Applications
VMware Tanzu
 
Event Driven Architecture
Event Driven Architecture
Chris Patterson
 
Event Driven Architecture
Event Driven Architecture
Stefan Norberg
 
Manchester MuleSoft Meetup #6 - Runtime Fabric with Mulesoft
Manchester MuleSoft Meetup #6 - Runtime Fabric with Mulesoft
Akshata Sawant
 
AWS CloudFront 가속 및 DDoS 방어
AWS CloudFront 가속 및 DDoS 방어
Kyle(KY) Yang
 
Application server vs Web Server
Application server vs Web Server
Gagandeep Singh
 
Html power point
Html power point
minmon
 
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
Open Source Consulting
 
MLOps 플랫폼을 만드는 과정의 고민과 해결 사례 공유(feat. Kubeflow)
MLOps 플랫폼을 만드는 과정의 고민과 해결 사례 공유(feat. Kubeflow)
Jaeyeon Kim
 
Java (spring) vs javascript (node.js)
Java (spring) vs javascript (node.js)
류 영수
 
[AWS Dev Day] 앱 현대화 | 코드 기반 인프라(IaC)를 활용한 현대 애플리케이션 개발 가속화, 우리도 할 수 있어요 - 김필중...
[AWS Dev Day] 앱 현대화 | 코드 기반 인프라(IaC)를 활용한 현대 애플리케이션 개발 가속화, 우리도 할 수 있어요 - 김필중...
Amazon Web Services Korea
 
Frequently asked MuleSoft Interview Questions and Answers from Techlightning
Frequently asked MuleSoft Interview Questions and Answers from Techlightning
Arul ChristhuRaj Alphonse
 
데브옵스 엔지니어를 위한 신규 운영 서비스 - 김필중, AWS 개발 전문 솔루션즈 아키텍트 / 김현민, 메가존클라우드 솔루션즈 아키텍트 :...
데브옵스 엔지니어를 위한 신규 운영 서비스 - 김필중, AWS 개발 전문 솔루션즈 아키텍트 / 김현민, 메가존클라우드 솔루션즈 아키텍트 :...
Amazon Web Services Korea
 
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon Web Services Korea
 
성능 최대화를 위한 CloudFront 설정 Best Practice
성능 최대화를 위한 CloudFront 설정 Best Practice
GS Neotek
 
DDoS and WAF basics
DDoS and WAF basics
Yoohyun Kim
 
Design Pattern - MVC, MVP and MVVM
Design Pattern - MVC, MVP and MVVM
Mudasir Qazi
 
CQRS and Event Sourcing with Axon Framework
CQRS and Event Sourcing with Axon Framework
João Rafael Campos da Silva
 
Circuit Breaker Pattern
Circuit Breaker Pattern
Vikash Kodati
 
Why to Cloud Native
Why to Cloud Native
Karthik Gaekwad
 
[웨비나] 다중 AWS 계정에서의 CI/CD 구축
[웨비나] 다중 AWS 계정에서의 CI/CD 구축
BESPIN GLOBAL
 
Architecting an Enterprise API Management Strategy
Architecting an Enterprise API Management Strategy
WSO2
 
Эффективная отладка репликации MySQL
Эффективная отладка репликации MySQL
Sveta Smirnova
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
FromDual GmbH
 

More Related Content

What's hot (20)

Event Driven Architecture
Event Driven Architecture
Stefan Norberg
 
Manchester MuleSoft Meetup #6 - Runtime Fabric with Mulesoft
Manchester MuleSoft Meetup #6 - Runtime Fabric with Mulesoft
Akshata Sawant
 
AWS CloudFront 가속 및 DDoS 방어
AWS CloudFront 가속 및 DDoS 방어
Kyle(KY) Yang
 
Application server vs Web Server
Application server vs Web Server
Gagandeep Singh
 
Html power point
Html power point
minmon
 
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
Open Source Consulting
 
MLOps 플랫폼을 만드는 과정의 고민과 해결 사례 공유(feat. Kubeflow)
MLOps 플랫폼을 만드는 과정의 고민과 해결 사례 공유(feat. Kubeflow)
Jaeyeon Kim
 
Java (spring) vs javascript (node.js)
Java (spring) vs javascript (node.js)
류 영수
 
[AWS Dev Day] 앱 현대화 | 코드 기반 인프라(IaC)를 활용한 현대 애플리케이션 개발 가속화, 우리도 할 수 있어요 - 김필중...
[AWS Dev Day] 앱 현대화 | 코드 기반 인프라(IaC)를 활용한 현대 애플리케이션 개발 가속화, 우리도 할 수 있어요 - 김필중...
Amazon Web Services Korea
 
Frequently asked MuleSoft Interview Questions and Answers from Techlightning
Frequently asked MuleSoft Interview Questions and Answers from Techlightning
Arul ChristhuRaj Alphonse
 
데브옵스 엔지니어를 위한 신규 운영 서비스 - 김필중, AWS 개발 전문 솔루션즈 아키텍트 / 김현민, 메가존클라우드 솔루션즈 아키텍트 :...
데브옵스 엔지니어를 위한 신규 운영 서비스 - 김필중, AWS 개발 전문 솔루션즈 아키텍트 / 김현민, 메가존클라우드 솔루션즈 아키텍트 :...
Amazon Web Services Korea
 
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon Web Services Korea
 
성능 최대화를 위한 CloudFront 설정 Best Practice
성능 최대화를 위한 CloudFront 설정 Best Practice
GS Neotek
 
DDoS and WAF basics
DDoS and WAF basics
Yoohyun Kim
 
Design Pattern - MVC, MVP and MVVM
Design Pattern - MVC, MVP and MVVM
Mudasir Qazi
 
CQRS and Event Sourcing with Axon Framework
CQRS and Event Sourcing with Axon Framework
João Rafael Campos da Silva
 
Circuit Breaker Pattern
Circuit Breaker Pattern
Vikash Kodati
 
Why to Cloud Native
Why to Cloud Native
Karthik Gaekwad
 
[웨비나] 다중 AWS 계정에서의 CI/CD 구축
[웨비나] 다중 AWS 계정에서의 CI/CD 구축
BESPIN GLOBAL
 
Architecting an Enterprise API Management Strategy
Architecting an Enterprise API Management Strategy
WSO2
 
Event Driven Architecture
Event Driven Architecture
Stefan Norberg
 
Manchester MuleSoft Meetup #6 - Runtime Fabric with Mulesoft
Manchester MuleSoft Meetup #6 - Runtime Fabric with Mulesoft
Akshata Sawant
 
AWS CloudFront 가속 및 DDoS 방어
AWS CloudFront 가속 및 DDoS 방어
Kyle(KY) Yang
 
Application server vs Web Server
Application server vs Web Server
Gagandeep Singh
 
Html power point
Html power point
minmon
 
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
Open Source Consulting
 
MLOps 플랫폼을 만드는 과정의 고민과 해결 사례 공유(feat. Kubeflow)
MLOps 플랫폼을 만드는 과정의 고민과 해결 사례 공유(feat. Kubeflow)
Jaeyeon Kim
 
Java (spring) vs javascript (node.js)
Java (spring) vs javascript (node.js)
류 영수
 
[AWS Dev Day] 앱 현대화 | 코드 기반 인프라(IaC)를 활용한 현대 애플리케이션 개발 가속화, 우리도 할 수 있어요 - 김필중...
[AWS Dev Day] 앱 현대화 | 코드 기반 인프라(IaC)를 활용한 현대 애플리케이션 개발 가속화, 우리도 할 수 있어요 - 김필중...
Amazon Web Services Korea
 
Frequently asked MuleSoft Interview Questions and Answers from Techlightning
Frequently asked MuleSoft Interview Questions and Answers from Techlightning
Arul ChristhuRaj Alphonse
 
데브옵스 엔지니어를 위한 신규 운영 서비스 - 김필중, AWS 개발 전문 솔루션즈 아키텍트 / 김현민, 메가존클라우드 솔루션즈 아키텍트 :...
데브옵스 엔지니어를 위한 신규 운영 서비스 - 김필중, AWS 개발 전문 솔루션즈 아키텍트 / 김현민, 메가존클라우드 솔루션즈 아키텍트 :...
Amazon Web Services Korea
 
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon Web Services Korea
 
성능 최대화를 위한 CloudFront 설정 Best Practice
성능 최대화를 위한 CloudFront 설정 Best Practice
GS Neotek
 
DDoS and WAF basics
DDoS and WAF basics
Yoohyun Kim
 
Design Pattern - MVC, MVP and MVVM
Design Pattern - MVC, MVP and MVVM
Mudasir Qazi
 
Circuit Breaker Pattern
Circuit Breaker Pattern
Vikash Kodati
 
[웨비나] 다중 AWS 계정에서의 CI/CD 구축
[웨비나] 다중 AWS 계정에서의 CI/CD 구축
BESPIN GLOBAL
 
Architecting an Enterprise API Management Strategy
Architecting an Enterprise API Management Strategy
WSO2
 

Viewers also liked (20)

Эффективная отладка репликации MySQL
Эффективная отладка репликации MySQL
Sveta Smirnova
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
FromDual GmbH
 
Galera cluster for high availability
Galera cluster for high availability
Mydbops
 
What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...
Sveta Smirnova
 
SQL Outer Joins for Fun and Profit
SQL Outer Joins for Fun and Profit
Karwin Software Solutions LLC
 
Hbase源码初探
Hbase源码初探
zhaolinjnu
 
MySQL High Availability Deep Dive
MySQL High Availability Deep Dive
hastexo
 
2010丹臣的思考
2010丹臣的思考
zhaolinjnu
 
Requirements the Last Bottleneck
Requirements the Last Bottleneck
Karwin Software Solutions LLC
 
MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)
frogd
 
Extensible Data Modeling
Extensible Data Modeling
Karwin Software Solutions LLC
 
MySQL High Availability Solutions
MySQL High Availability Solutions
Lenz Grimmer
 
Why MySQL High Availability Matters
Why MySQL High Availability Matters
Matt Lord
 
Mysql For Developers
Mysql For Developers
Carol McDonald
 
Redis介绍
Redis介绍
zhaolinjnu
 
The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!
Boris Hristov
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
Kenny Gryp
 
Lessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting Replication
Sveta Smirnova
 
Explain
Explain
Ligaya Turmelle
 
Advanced mysql replication techniques
Advanced mysql replication techniques
Giuseppe Maxia
 
Эффективная отладка репликации MySQL
Эффективная отладка репликации MySQL
Sveta Smirnova
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
FromDual GmbH
 
Galera cluster for high availability
Galera cluster for high availability
Mydbops
 
What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...
Sveta Smirnova
 
Hbase源码初探
Hbase源码初探
zhaolinjnu
 
MySQL High Availability Deep Dive
MySQL High Availability Deep Dive
hastexo
 
2010丹臣的思考
2010丹臣的思考
zhaolinjnu
 
MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)
frogd
 
MySQL High Availability Solutions
MySQL High Availability Solutions
Lenz Grimmer
 
Why MySQL High Availability Matters
Why MySQL High Availability Matters
Matt Lord
 
The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!
Boris Hristov
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
Kenny Gryp
 
Lessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting Replication
Sveta Smirnova
 
Advanced mysql replication techniques
Advanced mysql replication techniques
Giuseppe Maxia
 
Ad

Similar to Using Apache Spark and MySQL for Data Analysis (20)

Fosdem managing my sql with percona toolkit
Fosdem managing my sql with percona toolkit
Frederic Descamps
 
Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkit
Frederic Descamps
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
Wim Godden
 
Percona Live UK 2014 Part III
Percona Live UK 2014 Part III
Alkin Tezuysal
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Performance optimisations PHP meetup Rotterdam
Performance optimisations PHP meetup Rotterdam
Dimitri Vanoverbeke
 
Infrastructure review - Shining a light on the Black Box
Infrastructure review - Shining a light on the Black Box
Miklos Szel
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimization
mysqlops
 
20080611accel
20080611accel
Jeff Hammerbacher
 
介绍 Percona 服务器 XtraDB 和 Xtrabackup
介绍 Percona 服务器 XtraDB 和 Xtrabackup
YUCHENG HU
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Percona toolkit
Percona toolkit
Karwin Software Solutions LLC
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
Wim Godden
 
MySQL Ecosystem in 2020
MySQL Ecosystem in 2020
Alkin Tezuysal
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)
Robert Swisher
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
Olivier Doucet
 
Running a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQL
Kazuho Oku
 
Beyond php it's not (just) about the code
Beyond php it's not (just) about the code
Wim Godden
 
Fosdem managing my sql with percona toolkit
Fosdem managing my sql with percona toolkit
Frederic Descamps
 
Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkit
Frederic Descamps
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
Wim Godden
 
Percona Live UK 2014 Part III
Percona Live UK 2014 Part III
Alkin Tezuysal
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Performance optimisations PHP meetup Rotterdam
Performance optimisations PHP meetup Rotterdam
Dimitri Vanoverbeke
 
Infrastructure review - Shining a light on the Black Box
Infrastructure review - Shining a light on the Black Box
Miklos Szel
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Percona Live 2012PPT: MySQL Query optimization
Percona Live 2012PPT: MySQL Query optimization
mysqlops
 
介绍 Percona 服务器 XtraDB 和 Xtrabackup
介绍 Percona 服务器 XtraDB 和 Xtrabackup
YUCHENG HU
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
Wim Godden
 
MySQL Ecosystem in 2020
MySQL Ecosystem in 2020
Alkin Tezuysal
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)
Robert Swisher
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
Olivier Doucet
 
Running a Realtime Stats Service on MySQL
Running a Realtime Stats Service on MySQL
Kazuho Oku
 
Beyond php it's not (just) about the code
Beyond php it's not (just) about the code
Wim Godden
 
Ad

More from Sveta Smirnova (20)

War Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona Toolkit
Sveta Smirnova
 
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
Sveta Smirnova
 
Database in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and Monitoring
Sveta Smirnova
 
MySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to Have
Sveta Smirnova
 
MySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for Developers
Sveta Smirnova
 
MySQL Performance for DevOps
MySQL Performance for DevOps
Sveta Smirnova
 
MySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации багов
Sveta Smirnova
 
MySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your Business
Sveta Smirnova
 
Introduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]s
Sveta Smirnova
 
Производительность MySQL для DevOps
Производительность MySQL для DevOps
Sveta Smirnova
 
MySQL Performance for DevOps
MySQL Performance for DevOps
Sveta Smirnova
 
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
Sveta Smirnova
 
How to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tears
Sveta Smirnova
 
Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...
Sveta Smirnova
 
How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?
Sveta Smirnova
 
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Sveta Smirnova
 
How to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with Galera
Sveta Smirnova
 
How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?
Sveta Smirnova
 
Introduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]s
Sveta Smirnova
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?
Sveta Smirnova
 
War Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona Toolkit
Sveta Smirnova
 
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
Sveta Smirnova
 
Database in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and Monitoring
Sveta Smirnova
 
MySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to Have
Sveta Smirnova
 
MySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for Developers
Sveta Smirnova
 
MySQL Performance for DevOps
MySQL Performance for DevOps
Sveta Smirnova
 
MySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации багов
Sveta Smirnova
 
MySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your Business
Sveta Smirnova
 
Introduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]s
Sveta Smirnova
 
Производительность MySQL для DevOps
Производительность MySQL для DevOps
Sveta Smirnova
 
MySQL Performance for DevOps
MySQL Performance for DevOps
Sveta Smirnova
 
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
Sveta Smirnova
 
How to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tears
Sveta Smirnova
 
Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...
Sveta Smirnova
 
How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?
Sveta Smirnova
 
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Sveta Smirnova
 
How to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with Galera
Sveta Smirnova
 
How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?
Sveta Smirnova
 
Introduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]s
Sveta Smirnova
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?
Sveta Smirnova
 

Recently uploaded (20)

Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
BradBedford3
 
University Campus Navigation for All - Peak of Data & AI
University Campus Navigation for All - Peak of Data & AI
Safe Software
 
HYBRIDIZATION OF ALKANES AND ALKENES ...
HYBRIDIZATION OF ALKANES AND ALKENES ...
karishmaduhijod1
 
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
Hassan Abid
 
Top Time Tracking Solutions for Accountants
Top Time Tracking Solutions for Accountants
oliviareed320
 
Streamlining CI/CD with FME Flow: A Practical Guide
Streamlining CI/CD with FME Flow: A Practical Guide
Safe Software
 
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
2nd Sight Lab
 
Key Challenges in Troubleshooting Customer On-Premise Applications
Key Challenges in Troubleshooting Customer On-Premise Applications
Tier1 app
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
Complete Guideliness to Build an Effective Maintenance Plan.ppt
Complete Guideliness to Build an Effective Maintenance Plan.ppt
QualityzeInc1
 
Sap basis role in public cloud in s/4hana.pptx
Sap basis role in public cloud in s/4hana.pptx
htmlprogrammer987
 
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
pcprocore
 
Best Practice for LLM Serving in the Cloud
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Digital Transformation: Automating the Placement of Medical Interns
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
Complete WordPress Programming Guidance Book
Complete WordPress Programming Guidance Book
Shabista Imam
 
Best Software Development at Best Prices
Best Software Development at Best Prices
softechies7
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
Why Edge Computing Matters in Mobile Application Tech.pdf
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
 
Simplify Task, Team, and Project Management with Orangescrum Work
Simplify Task, Team, and Project Management with Orangescrum Work
Orangescrum
 
Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
BradBedford3
 
University Campus Navigation for All - Peak of Data & AI
University Campus Navigation for All - Peak of Data & AI
Safe Software
 
HYBRIDIZATION OF ALKANES AND ALKENES ...
HYBRIDIZATION OF ALKANES AND ALKENES ...
karishmaduhijod1
 
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
Hassan Abid
 
Top Time Tracking Solutions for Accountants
Top Time Tracking Solutions for Accountants
oliviareed320
 
Streamlining CI/CD with FME Flow: A Practical Guide
Streamlining CI/CD with FME Flow: A Practical Guide
Safe Software
 
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
Threat Modeling a Batch Job Framework - Teri Radichel - AWS re:Inforce 2025
2nd Sight Lab
 
Key Challenges in Troubleshooting Customer On-Premise Applications
Key Challenges in Troubleshooting Customer On-Premise Applications
Tier1 app
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
Complete Guideliness to Build an Effective Maintenance Plan.ppt
Complete Guideliness to Build an Effective Maintenance Plan.ppt
QualityzeInc1
 
Sap basis role in public cloud in s/4hana.pptx
Sap basis role in public cloud in s/4hana.pptx
htmlprogrammer987
 
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
IDM Crack with Internet Download Manager 6.42 Build 41 [Latest 2025]
pcprocore
 
Best Practice for LLM Serving in the Cloud
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Digital Transformation: Automating the Placement of Medical Interns
Digital Transformation: Automating the Placement of Medical Interns
Safe Software
 
Complete WordPress Programming Guidance Book
Complete WordPress Programming Guidance Book
Shabista Imam
 
Best Software Development at Best Prices
Best Software Development at Best Prices
softechies7
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
Why Edge Computing Matters in Mobile Application Tech.pdf
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
 
Simplify Task, Team, and Project Management with Orangescrum Work
Simplify Task, Team, and Project Management with Orangescrum Work
Orangescrum
 

Using Apache Spark and MySQL for Data Analysis

  • 1. Using Apache Spark and MySQL for Data Analysis Alexander Rubin, Sveta Smirnova Percona February, 4, 2017
  • 2. www.percona.com Agenda • Why Spark? • Spark Examples – Wikistats analysis with Spark
  • 3. www.percona.com Data / SQL / Protocol SQL/ App What is Spark anyway? Nodes Parallel Compute only Local FS ?
  • 4. www.percona.com • In memory processing with caching • Massively Parallel • Direct access to data sources (i.e.MySQL) >>> df = sqlContext.load(source="jdbc", url="jdbc:mysql://localhost?user=root", dbtable="ontime.ontime_sm”) • Can store data in Hadoop HDFS / S3 / local Filesystem • Native Python and R integration Why Spark?
  • 6. www.percona.com Spark vs. MySQL for BigData Indexes Partitioning “Sharding” Full table scan Partitioning Map/Reduce
  • 7. www.percona.com Spark (vs. MySQL) • No indexes • All processing is full scan • BUT: distributed and parallel • No transactions • High latency (usually) MySQL: 1 query = 1 CPU core
  • 8. www.percona.com Indexes (BTree) for Big Data challenge • Creating an index for Petabytes of data? • Updating an index for Petabytes of data? • Reading a terabyte index? • Random read of Petabyte? Full scan in parallel is better for big data
  • 9. www.percona.com ETL / Pipeline 1. Extract data from external source 2. Transform before loading 3. Load data into MySQL 1. Extract data from external source 2. Load data or rsync to all spark nodes 3. Transform data/Analyze data/Visualize data; Parallelism
  • 10. www.percona.com Schema on Read Schema on Write • Load data infile will verify the input (validate) • … indirect data conversion • ... or fail if number of cols is wrong Schema on Read • No “load data” per se, nothing to validate here • … Create external table or read csv • ... will validate on “read”/ select
  • 11. www.percona.com Example: Loading wikistat into MySQL 1. Extract data from external source and uncompress! 2. Load data into MySQL and Transform Wikipedia page counts – download, >10TB load data local infile '$file' into table wikistats.wikistats_full CHARACTER SET latin1 FIELDS TERMINATED BY ' ' (project_name, title, num_requests, content_size) set request_date = STR_TO_DATE('$datestr', '%Y%m%d %H%i%S'), title_md5=unhex(md5(title)); http://dumps.wikimedia.org/other/pagecounts-raw/
  • 12. www.percona.com Load timing per hour of wikistat • InnoDB: 52.34 sec • MyISAM: 11.08 sec (+ indexes) • 1 hour of wikistats =1 minute • 1 year will load in 6 days – (8765.81 hours in 1 year) • 6 year = > 1 month to load Not even counting the insert time degradation…
  • 13. www.percona.com Loading wikistat as is into Spark • Just copy files to storage (AWS S3 / local / etc)… – And create SQL structure • Or read csv, aggregate/filter in Spark and – load the aggregated data into MySQL
  • 14. www.percona.com Loading wikistat as is into Spark • How fast to search? – Depends upon the number of nodes • 1000 nodes spark cluster – 4.5 TB, 104 Billion records – Exec time: 45 sec – Scanning 4.5TB of data • http://spark-summit.org/wp-content/uploads/2014/07/Building- 1000-node-Spark-Cluster-on-EMR.pdf
  • 16. www.percona.com Spark and WikiStats: load pipeline Row(project=p[0], url=urllib.unquote(p[1]).lower(), num_requests=int(p[2]), content_size=int(p[3])))
  • 17. www.percona.com Save results to MySQL group_res = sqlContext.sql( "SELECT '"+ mydate + "' as mydate, url, count(*) as cnt, sum(num_requests) as tot_visits FROM wikistats GROUP BY url") # Save to MySQL mysql_url="jdbc:mysql://localhost?user=wikistats&password= wikistats” group_res.write.jdbc(url=mysql_url, table="wikistats.wikistats_by_day_spark", mode="append")
  • 19. www.percona.com PySpark: CPU Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st Cpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st ... Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers
  • 22. www.percona.com mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10; +--------------------------------------------------------+------------+----------+ | lurl | max_visits | count(*) | +--------------------------------------------------------+------------+----------+ | heath_ledger | 4247338 | 131 | | cloverfield | 3846404 | 131 | | barack_obama | 2238406 | 153 | | 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 | | the_dark_knight_(film) | 1417186 | 64 | | martin_luther_king,_jr. | 1394934 | 136 | | deaths_in_2008 | 1372510 | 67 | | united_states | 1357253 | 167 | | scientology | 1349654 | 108 | | portal:current_events | 1261538 | 125 | +--------------------------------------------------------+------------+----------+ 10 rows in set (1 hour 22 min 10.02 sec) Search the WikiStats in MySQL 10 most frequently queried wiki pages in January 2008
  • 23. www.percona.com Search the WikiStats in SparkSQL spark-sql> CREATE TEMPORARY TABLE wikistats_parquet USING org.apache.spark.sql.parquet OPTIONS ( path "/ssd/wikistats_parquet_bydate" ); Time taken: 3.466 seconds spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10; heath_ledger 4247335 42 cloverfield 3846400 42 barack_obama 2238402 53 1925_in_baseball#negro_league_baseball_final_standings 1791341 11 the_dark_knight_(film) 1417183 36 martin_luther_king,_jr. 1394934 46 deaths_in_2008 1372510 38 united_states 1357251 55 scientology 1349650 44 portal:current_events 1261305 44 Time taken: 1239.014 seconds, Fetched 10 row(s) 10 most frequently queried wiki pages in January 2008 20 min
  • 24. www.percona.com Apache Drill Treat any datasource as a table (even it is not) Querying MongoDB with SQL
  • 26. www.percona.com Recap… 1. Search full dataset • May be pre-filtered • Not aggregated 2. No parallelism 3. Based on index? 4. InnoDB<> Columnar 5. Partitioning? 1. Dataset is already – Filtered (only site=“en”) – Aggregated (group by url) 2. Parallelism (+) 3. Not Based on index 4. Columnar (+) 5. Partitioning (+)