SlideShare a Scribd company logo
Running a Realtime Stats Service on MySQL Cybozu Labs, Inc. Kazuho Oku
Background Apr. 23 2009 Running Realtime Stats Service on MySQL
Who am I? Name: Kazuho Oku ( 奥 一穂 ) Original Developer of Palmscape / Xiino The oldest web browser for Palm OS Working at Cybozu Labs since 2005 Research subsidiary of Cybozu, Inc. Cybozu is a leading groupware vendor in Japan My weblog:  tinyurl.com/kazuho Apr. 23 2009 Running Realtime Stats Service on MySQL
Introduction of Pathtraq Apr. 23 2009 Running Realtime Stats Service on MySQL
What is Pathtraq? Started in Aug. 2007 Web ranking service One of Japan’s largest 〜 10,000 users submit access information 〜 1,000,000 access infomation per day like Alexa, but semi-realtime, and per-page Apr. 23 2009 Running Realtime Stats Service on MySQL
What is Pathtraq? (cont'd) Automated Social News Service find what's hot like Google News + Digg calculate relevance from access stats Search by... no filtering (all the Internet) by category by keyword by URL (per-domain, etc.) Apr. 23 2009 Running Realtime Stats Service on MySQL
 
 
How to Provide Real-time Analysis? Data Set (as of Apr. 23 2009) # of URLs: 147,748,546 # of total accesses: 413,272,527 Sharding is not a good option since we need to join the tables and aggregate prefix-search by URL, search by keyword, then join with access data table core tables should be stored on RAM not on HDD, due to lots of random access Apr. 23 2009 Running Realtime Stats Service on MySQL
Our Decision was to... Keep URL and access stats on RAM compression for  size and speed Create a new message queue Limit Pre-computation Load Create our own cache, with locks to minimize database access Fulltext-search database on SSD Apr. 23 2009 Running Realtime Stats Service on MySQL
Our Servers Main Server Opteron 2218 x2, 64GB Mem MySQL, Apache Fulltext Search Server Opteron 240EE, 2GB Mem, Intel SSD MySQL (w. Tritonn/Senna) Helper Servers for Content Analysis for Screenshot Generation Apr. 23 2009 Running Realtime Stats Service on MySQL
The Long Tail of the Internet y=C ・ x -0.44 # of URLs with 1/10 hits: x2.75 Apr. 23 2009 Running Realtime Stats Service on MySQL
Compressing URLs Apr. 23 2009 Running Realtime Stats Service on MySQL
Compressing URLs The Challenges: URLs are too short for gzip, etc. URLs should be prefix-searchable in compressed form How to run  like 'http://www.mysql.com/%'  on a compressed URL? The Answer: Static PPM + Range Coder Apr. 23 2009 Running Realtime Stats Service on MySQL
Static PPM PPM: Prediction by Partial Matching What is the next character after ".co"? The answer is "m"! PPM is used by 7-zip, etc. Static PPM is PPM with static probabilistic model Many URLs (or English words) have common patterns Suitable for short texts (like URLs) Apr. 23 2009 Running Realtime Stats Service on MySQL
Range Coder A fast variant of arithmetic compression similar to huffmann encoding, but better If probability of next character being "m" was 75%, it will be encoded into 0.42 bit Compressed strings preserve the sort order of uncompressed form Apr. 23 2009 Running Realtime Stats Service on MySQL
Create Compression Functions Build prediction table from stored URLs Implement range coder took an open-source impl. and optimized it original impl. added some bits unnecessary at the tail use SSE instructions for faster operation coderepos.org/share/browser/lang/cplusplus/range_coder Link the coder and the table to create MySQL UDFs Apr. 23 2009 Running Realtime Stats Service on MySQL
Rewriting the Server Logic Change schema url varchar(255) not null  # with unique index ↓ urlc varbinary(767) not null  # with unique index Change prefix-search form url like 'http://example.com/%' ↓ url_compress('http://example.com/')<=urlc and urlc<url_compress('http://example.com 0 ') Note: &quot;0&quot; is next character of '/' Apr. 23 2009 Running Realtime Stats Service on MySQL
Compression Ratio Compression ratio: 37% Size of prediction table: 4MB Benchmark of the compression functions compression: 40MB/sec. (570k URLs/sec.) decompression: 19.3MB/sec. (280k URLs/sec.) fast enough since searchable in compressed form Prefix-search became faster shorter indexes lead to faster operation Apr. 23 2009 Running Realtime Stats Service on MySQL
Re InnoDB Compression URL Compression can coexist with InnoDB compression though we aren't using InnoDB compression on our production environment Apr. 23 2009 Running Realtime Stats Service on MySQL Compression Table Size N/A 100% URL compression 57% InnoDB compression 50% using both 33%
Compressing the Stats Table Used to have two int columns:  at ,  cnt it was waste of space, since... most cnt values are very small numbers most accesses to each URL occur on a short period (ex. the day the blog entry was written) at  field should be part of the indexes Apr. 23 2009 Running Realtime Stats Service on MySQL at (hours since epoch) cnt (# of hits) 330168 1 330169 2 330173 1 330197 1
Compressing the Stats Table (cont'd) Merge the rows into a sparse array example on the prev. page becomes: (offset=330197),1,0(repeated 23 times),1,2,1 Then compress the array the example becomes a blob of 8 bytes originally was 8 bytes x 4 rows with index And store the array in a single column fewer rows lead to smaller table, faster access Apr. 23 2009 Running Realtime Stats Service on MySQL
Compressing the Stats Table (cont'd) Write MySQL UDFs to access the sparse array cnt_add(column,at,cnt) -- adds cnt on given index (at) cnt_between(column,from,to) -- returns # of hits between given hours and more... We use int[N] arrays for vectorized calc. especially when creating access charts Apr. 23 2009 Running Realtime Stats Service on MySQL
Create a new Message Queue Apr. 23 2009 Running Realtime Stats Service on MySQL
Q4M A simple, reliable, fast message queue runs as a pluggable storage engine of MySQL GPL License;  q4m.31tools.com presented yesterday at MySQL Conference :-p slides at  tinyurl.com/q4m2009 Used for relaying messages between our servers Apr. 23 2009 Running Realtime Stats Service on MySQL
Limiting Pre-computation Load Apr. 23 2009 Running Realtime Stats Service on MySQL
Limit # of CPU-intensive Pre-computations Use cron & setlock setlock is part of daemontools by djb setlock serializes processes by using flock -n option: use trylock; if locked, do nothing # use only one CPU core for pre-computation */2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries 5 0 * * *  setlock /tmp/tasks.lock precompute_yesterday_data Apr. 23 2009 Running Realtime Stats Service on MySQL
Limit # of Disk-intensive Pre-computations Divide pre-computation to blocks and sleep depending on the elapsed time my $LOAD = 0.25; while (true) { my $start = time(); precompute_block(); sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD); }  Apr. 23 2009 Running Realtime Stats Service on MySQL
Creating our own Cache System Apr. 23 2009 Running Realtime Stats Service on MySQL
The Problem Query cache is flushed on table update access stats can be (should be) cached for a certain period Memcached has a thundering-herd problem all clients try to read the database when a cached-entry expires critical for us since our queries does joins, aggregations, and sort operations Apr. 23 2009 Running Realtime Stats Service on MySQL
Swifty and KeyedMutex Swifty is a mmap-based cache cached data shared between processes lock-free on read, flock on write notifies a single client that the accessed entry is going to expire within few seconds notified client can start updating a cache entry before it expires KeyedMutex a daemon used to block multiple clients issuing same SQL queries Apr. 23 2009 Running Realtime Stats Service on MySQL
Swifty and KeyedMutexd (cont'd) Source codes are available: coderepos.org/share/browser/lang/c/swifty coderepos.org/share/browser/lang/perl/Cache-Swifty coderepos.org/share/browser/lang/perl/KeyedMutex Apr. 23 2009 Running Realtime Stats Service on MySQL
Fulltext-search on SSD Apr. 23 2009 Running Realtime Stats Service on MySQL
Senna / Tritonn Senna is a FTS engine popular in Japan might not work well with European languages Tritonn is a replacement of MyISAM FTS uses Senna as backend faster than MyISAM FTS Wrote patches to support SSD during our transition from RAM to SSD patches accepted in Senna 1.1.4 / Tritonn 1.0.12 Apr. 23 2009 Running Realtime Stats Service on MySQL
FTS: RAM-based vs. SSD-based Size of FTS data:  〜  20GB Downgraded hardware to see if SSD-based FTS is feasible Speed became ¼ but latency of searches are well below one second Apr. 23 2009 Running Realtime Stats Service on MySQL Old Hardware New Hardware CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz) Memory 32GB 2GB Storage 7200rpm SATA HDD SSD (Intel X25-M)
Summary Apr. 23 2009 Running Realtime Stats Service on MySQL
Summary Use UDFs for optimization Sometime it is easier to scale  UP esp. when you can estimate your data growth Use SSD for FTS Baidu (China's leading search engine) uses SSD Most of the things introduced are OSS We plan to open-source our URL compression table as well Apr. 23 2009 Running Realtime Stats Service on MySQL
We are Looking for... If you are interested in localizing Pathtraq to your country, please contact us we do not have resources outside of Japan to translate the web interface to ask people to install our browser extension to follow local regulations, etc. Apr. 23 2009 Running Realtime Stats Service on MySQL
Thank you for listening tinyurl.com/kazuho Apr. 23 2009 Running Realtime Stats Service on MySQL

More Related Content

PDF
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
PDF
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
PDF
Altinity Cluster Manager: ClickHouse Management for Kubernetes and Cloud
PDF
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
PDF
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
PDF
How to Develop and Operate Cloud First Data Platforms
PPTX
How to build analytics for 100bn logs a month with ClickHouse. By Vadim Tkach...
PDF
Analyzing MySQL Logs with ClickHouse, by Peter Zaitsev
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Cluster Manager: ClickHouse Management for Kubernetes and Cloud
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
How to Develop and Operate Cloud First Data Platforms
How to build analytics for 100bn logs a month with ClickHouse. By Vadim Tkach...
Analyzing MySQL Logs with ClickHouse, by Peter Zaitsev

What's hot (16)

PPTX
Adventures in RDS Load Testing
PPTX
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
PPTX
Amazon RDS for PostgreSQL - PGConf 2016
PDF
PostgreSQL Replication High Availability Methods
PDF
ClickHouse new features and development roadmap, by Aleksei Milovidov
PDF
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PDF
Learning postgresql
PDF
Propelling IoT Innovation with Predictive Analytics
PDF
Analysis postgre sql-vs_mongodb_report
PDF
Improve Presto Architectural Decisions with Shadow Cache
PDF
ClickHouse Keeper
PDF
Wayfair Use Case: The four R's of Metrics Delivery
PDF
Managing your Black Friday Logs
PDF
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
PDF
SQream on Ibm power9 (english)
PPTX
Accumulo Summit 2015: Reactive programming in Accumulo: The Observable WAL [I...
Adventures in RDS Load Testing
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Amazon RDS for PostgreSQL - PGConf 2016
PostgreSQL Replication High Availability Methods
ClickHouse new features and development roadmap, by Aleksei Milovidov
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
Learning postgresql
Propelling IoT Innovation with Predictive Analytics
Analysis postgre sql-vs_mongodb_report
Improve Presto Architectural Decisions with Shadow Cache
ClickHouse Keeper
Wayfair Use Case: The four R's of Metrics Delivery
Managing your Black Friday Logs
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
SQream on Ibm power9 (english)
Accumulo Summit 2015: Reactive programming in Accumulo: The Observable WAL [I...
Ad

Viewers also liked (20)

PPTX
Running and health
PDF
15 momentów na rebranding
PPT
Corel
PDF
ICF - EE in HOME Workshop
PPS
Humour Et Insolites
PPT
Fsqparticsubat
ODP
Transient v Persistent data on Twitter
PDF
Injoos Corporate Presentation
PDF
Surf Port Interactive Program Feb 04
PDF
PATIENT Workshop at GMA2013
PPT
What Do We Know About IPL Users?
PPT
Little Ones Learning Math Using Technology
PPS
香港六合彩 &raquo; SlideShare
PDF
Finding and sharing good stuff: open practice, open educational resources and...
PDF
如何利用社交媒体制造商机 Using social media to find business opportunities
PDF
Injoos corporate presentation webinar oct 2009 ver1
PPT
E E M N1
PPS
Kkka Korunma
PPTX
A Long Walk to Water - Lssn 14
Running and health
15 momentów na rebranding
Corel
ICF - EE in HOME Workshop
Humour Et Insolites
Fsqparticsubat
Transient v Persistent data on Twitter
Injoos Corporate Presentation
Surf Port Interactive Program Feb 04
PATIENT Workshop at GMA2013
What Do We Know About IPL Users?
Little Ones Learning Math Using Technology
香港六合彩 &raquo; SlideShare
Finding and sharing good stuff: open practice, open educational resources and...
如何利用社交媒体制造商机 Using social media to find business opportunities
Injoos corporate presentation webinar oct 2009 ver1
E E M N1
Kkka Korunma
A Long Walk to Water - Lssn 14
Ad

Similar to Running a Realtime Stats Service on MySQL (13)

PDF
Amazed by AWS Series #4
PPTX
Running Presto and Spark on the Netflix Big Data Platform
PDF
Amazon Aurora (MySQL, Postgres)
PDF
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
PDF
Introdução ao data warehouse Amazon Redshift
PPTX
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
PPTX
An Introduction to Amazon Aurora Cloud-native Relational Database
PPTX
Final Presentation
PDF
2021 04-20 apache arrow and its impact on the database industry.pptx
PPTX
Tracking the Performance of the Web Over Time with the HTTP Archive
PPTX
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
PDF
Real World Storage in Treasure Data
PPTX
Amazon Aurora Getting started Guide -level 0
Amazed by AWS Series #4
Running Presto and Spark on the Netflix Big Data Platform
Amazon Aurora (MySQL, Postgres)
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Introdução ao data warehouse Amazon Redshift
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
An Introduction to Amazon Aurora Cloud-native Relational Database
Final Presentation
2021 04-20 apache arrow and its impact on the database industry.pptx
Tracking the Performance of the Web Over Time with the HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Real World Storage in Treasure Data
Amazon Aurora Getting started Guide -level 0

More from Kazuho Oku (20)

PDF
HTTP/2で 速くなるとき ならないとき
PDF
QUIC標準化動向 〜2017/7
PDF
HTTP/2の課題と将来
PDF
TLS 1.3 と 0-RTT のこわ〜い話
PDF
Reorganizing Website Architecture for HTTP/2 and Beyond
PPTX
Recent Advances in HTTP, controlling them using ruby
PPTX
Programming TCP for responsiveness
PDF
Programming TCP for responsiveness
PDF
Developing the fastest HTTP/2 server
PPTX
TLS & LURK @ IETF 95
PPTX
HTTPとサーバ技術の最新動向
PPTX
ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先
PPTX
Cache aware-server-push in H2O version 1.5
PDF
HTTP/2時代のウェブサイト設計
PDF
H2O - making the Web faster
PDF
H2O - making HTTP better
PDF
H2O - the optimized HTTP server
PPTX
JSON SQL Injection and the Lessons Learned
PPTX
JSX 速さの秘密 - 高速なJavaScriptを書く方法
PPTX
JSX の現在と未来 - Oct 26 2013
HTTP/2で 速くなるとき ならないとき
QUIC標準化動向 〜2017/7
HTTP/2の課題と将来
TLS 1.3 と 0-RTT のこわ〜い話
Reorganizing Website Architecture for HTTP/2 and Beyond
Recent Advances in HTTP, controlling them using ruby
Programming TCP for responsiveness
Programming TCP for responsiveness
Developing the fastest HTTP/2 server
TLS & LURK @ IETF 95
HTTPとサーバ技術の最新動向
ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先
Cache aware-server-push in H2O version 1.5
HTTP/2時代のウェブサイト設計
H2O - making the Web faster
H2O - making HTTP better
H2O - the optimized HTTP server
JSON SQL Injection and the Lessons Learned
JSX 速さの秘密 - 高速なJavaScriptを書く方法
JSX の現在と未来 - Oct 26 2013

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
SOPHOS-XG Firewall Administrator PPT.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Reach Out and Touch Someone: Haptics and Empathic Computing
1. Introduction to Computer Programming.pptx
Big Data Technologies - Introduction.pptx
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
Dropbox Q2 2025 Financial Results & Investor Presentation
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
MIND Revenue Release Quarter 2 2025 Press Release
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Running a Realtime Stats Service on MySQL

  • 1. Running a Realtime Stats Service on MySQL Cybozu Labs, Inc. Kazuho Oku
  • 2. Background Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 3. Who am I? Name: Kazuho Oku ( 奥 一穂 ) Original Developer of Palmscape / Xiino The oldest web browser for Palm OS Working at Cybozu Labs since 2005 Research subsidiary of Cybozu, Inc. Cybozu is a leading groupware vendor in Japan My weblog: tinyurl.com/kazuho Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 4. Introduction of Pathtraq Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 5. What is Pathtraq? Started in Aug. 2007 Web ranking service One of Japan’s largest 〜 10,000 users submit access information 〜 1,000,000 access infomation per day like Alexa, but semi-realtime, and per-page Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 6. What is Pathtraq? (cont'd) Automated Social News Service find what's hot like Google News + Digg calculate relevance from access stats Search by... no filtering (all the Internet) by category by keyword by URL (per-domain, etc.) Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 7.  
  • 8.  
  • 9. How to Provide Real-time Analysis? Data Set (as of Apr. 23 2009) # of URLs: 147,748,546 # of total accesses: 413,272,527 Sharding is not a good option since we need to join the tables and aggregate prefix-search by URL, search by keyword, then join with access data table core tables should be stored on RAM not on HDD, due to lots of random access Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 10. Our Decision was to... Keep URL and access stats on RAM compression for size and speed Create a new message queue Limit Pre-computation Load Create our own cache, with locks to minimize database access Fulltext-search database on SSD Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 11. Our Servers Main Server Opteron 2218 x2, 64GB Mem MySQL, Apache Fulltext Search Server Opteron 240EE, 2GB Mem, Intel SSD MySQL (w. Tritonn/Senna) Helper Servers for Content Analysis for Screenshot Generation Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 12. The Long Tail of the Internet y=C ・ x -0.44 # of URLs with 1/10 hits: x2.75 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 13. Compressing URLs Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 14. Compressing URLs The Challenges: URLs are too short for gzip, etc. URLs should be prefix-searchable in compressed form How to run like 'http://www.mysql.com/%' on a compressed URL? The Answer: Static PPM + Range Coder Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 15. Static PPM PPM: Prediction by Partial Matching What is the next character after &quot;.co&quot;? The answer is &quot;m&quot;! PPM is used by 7-zip, etc. Static PPM is PPM with static probabilistic model Many URLs (or English words) have common patterns Suitable for short texts (like URLs) Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 16. Range Coder A fast variant of arithmetic compression similar to huffmann encoding, but better If probability of next character being &quot;m&quot; was 75%, it will be encoded into 0.42 bit Compressed strings preserve the sort order of uncompressed form Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 17. Create Compression Functions Build prediction table from stored URLs Implement range coder took an open-source impl. and optimized it original impl. added some bits unnecessary at the tail use SSE instructions for faster operation coderepos.org/share/browser/lang/cplusplus/range_coder Link the coder and the table to create MySQL UDFs Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 18. Rewriting the Server Logic Change schema url varchar(255) not null # with unique index ↓ urlc varbinary(767) not null # with unique index Change prefix-search form url like 'http://example.com/%' ↓ url_compress('http://example.com/')<=urlc and urlc<url_compress('http://example.com 0 ') Note: &quot;0&quot; is next character of '/' Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 19. Compression Ratio Compression ratio: 37% Size of prediction table: 4MB Benchmark of the compression functions compression: 40MB/sec. (570k URLs/sec.) decompression: 19.3MB/sec. (280k URLs/sec.) fast enough since searchable in compressed form Prefix-search became faster shorter indexes lead to faster operation Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 20. Re InnoDB Compression URL Compression can coexist with InnoDB compression though we aren't using InnoDB compression on our production environment Apr. 23 2009 Running Realtime Stats Service on MySQL Compression Table Size N/A 100% URL compression 57% InnoDB compression 50% using both 33%
  • 21. Compressing the Stats Table Used to have two int columns: at , cnt it was waste of space, since... most cnt values are very small numbers most accesses to each URL occur on a short period (ex. the day the blog entry was written) at field should be part of the indexes Apr. 23 2009 Running Realtime Stats Service on MySQL at (hours since epoch) cnt (# of hits) 330168 1 330169 2 330173 1 330197 1
  • 22. Compressing the Stats Table (cont'd) Merge the rows into a sparse array example on the prev. page becomes: (offset=330197),1,0(repeated 23 times),1,2,1 Then compress the array the example becomes a blob of 8 bytes originally was 8 bytes x 4 rows with index And store the array in a single column fewer rows lead to smaller table, faster access Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 23. Compressing the Stats Table (cont'd) Write MySQL UDFs to access the sparse array cnt_add(column,at,cnt) -- adds cnt on given index (at) cnt_between(column,from,to) -- returns # of hits between given hours and more... We use int[N] arrays for vectorized calc. especially when creating access charts Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 24. Create a new Message Queue Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 25. Q4M A simple, reliable, fast message queue runs as a pluggable storage engine of MySQL GPL License; q4m.31tools.com presented yesterday at MySQL Conference :-p slides at tinyurl.com/q4m2009 Used for relaying messages between our servers Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 26. Limiting Pre-computation Load Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 27. Limit # of CPU-intensive Pre-computations Use cron & setlock setlock is part of daemontools by djb setlock serializes processes by using flock -n option: use trylock; if locked, do nothing # use only one CPU core for pre-computation */2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries 5 0 * * * setlock /tmp/tasks.lock precompute_yesterday_data Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 28. Limit # of Disk-intensive Pre-computations Divide pre-computation to blocks and sleep depending on the elapsed time my $LOAD = 0.25; while (true) { my $start = time(); precompute_block(); sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD); } Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 29. Creating our own Cache System Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 30. The Problem Query cache is flushed on table update access stats can be (should be) cached for a certain period Memcached has a thundering-herd problem all clients try to read the database when a cached-entry expires critical for us since our queries does joins, aggregations, and sort operations Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 31. Swifty and KeyedMutex Swifty is a mmap-based cache cached data shared between processes lock-free on read, flock on write notifies a single client that the accessed entry is going to expire within few seconds notified client can start updating a cache entry before it expires KeyedMutex a daemon used to block multiple clients issuing same SQL queries Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 32. Swifty and KeyedMutexd (cont'd) Source codes are available: coderepos.org/share/browser/lang/c/swifty coderepos.org/share/browser/lang/perl/Cache-Swifty coderepos.org/share/browser/lang/perl/KeyedMutex Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 33. Fulltext-search on SSD Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 34. Senna / Tritonn Senna is a FTS engine popular in Japan might not work well with European languages Tritonn is a replacement of MyISAM FTS uses Senna as backend faster than MyISAM FTS Wrote patches to support SSD during our transition from RAM to SSD patches accepted in Senna 1.1.4 / Tritonn 1.0.12 Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 35. FTS: RAM-based vs. SSD-based Size of FTS data: 〜 20GB Downgraded hardware to see if SSD-based FTS is feasible Speed became ¼ but latency of searches are well below one second Apr. 23 2009 Running Realtime Stats Service on MySQL Old Hardware New Hardware CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz) Memory 32GB 2GB Storage 7200rpm SATA HDD SSD (Intel X25-M)
  • 36. Summary Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 37. Summary Use UDFs for optimization Sometime it is easier to scale UP esp. when you can estimate your data growth Use SSD for FTS Baidu (China's leading search engine) uses SSD Most of the things introduced are OSS We plan to open-source our URL compression table as well Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 38. We are Looking for... If you are interested in localizing Pathtraq to your country, please contact us we do not have resources outside of Japan to translate the web interface to ask people to install our browser extension to follow local regulations, etc. Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 39. Thank you for listening tinyurl.com/kazuho Apr. 23 2009 Running Realtime Stats Service on MySQL