SlideShare a Scribd company logo
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a Large Number
of Collections
Anshum Gupta
Lucidworks
• Anshum Gupta, Apache Lucene/Solr PMC member
and committer, Lucidworks Employee.
• Interested in search and related stuff.
• Apache Lucene since 2006 and Solr since 2010.
• Organizations I am or have been a part of:
Who am I?
Apache Solr is the most widely-used search
solution on the planet.
Solr has tens of thousands of
applications in production.
You use everyday.
8,000,000+
Total downloads
Solr is both established
and growing.
250,000+
Monthly downloads
2,500+
Open Solr jobs and the largest
community of developers.
Solr Scalability is unmatched
The traditional search use-case
• One large index distributed across multiple nodes
• A large number of users searching on the same data
• Searches happen across the entire cluster
— Arthur C. Clarke
“The limits of the possible can only be defined by
going beyond them into the impossible.”
• Analyze and find missing features
• Setup a performance testing environment on AWS
• Devise tests for stability and performance
• Find bugs and bottlenecks and fixes
Analyze, measure, and optimize
• The SolrCloud cluster state has information about all collections,
their shards and replicas
• All nodes and (Java) clients watch the cluster state
• Every state change is notified to all nodes
• Limited to (slightly less than) 1MB by default
• 1 node restart triggers a few 100 watcher fires and pulls from ZK
for a 100 node cluster (three states: down, recovering and active)
Problem #1: Cluster state and updates
• Each collection gets it’s own state node in ZK
• Nodes selectively watch only those states which they
are a member of
• Clients cache state and use smart cache updates
instead of watching nodes
• http://issues.apache.org/jira/browse/SOLR-5473
Solution: Split cluster state and scale
• Thousands of collections create a lot of state updates
• Overseer falls behind and replicas can’t recover or
can’t elect a leader
• Under high indexing/search load, GC pauses can
cause overseer queue to back up
Problem #2: Overseer Performance
• Optimize polling for new items in overseer queue -
Don’t wait to poll! (SOLR-5436)
• Dedicated overseers nodes (SOLR-5476)
• New Overseer Status API (SOLR-5749)
• Asynchronous and multi-threaded execution of
collection commands (SOLR-5477, SOLR-5681)
Solution - Improve the Overseer
• Not all users are born equal - A tenant may have a few very
large users
• We wanted to be able to scale an individual user’s data —
maybe even as it’s own collection
• SolrCloud could split shards with no downtime but it only splits
in half
• No way to ‘extract’ user’s data to another collection or shard
Problem #3: Moving data around
• Shard can be split on arbitrary hash ranges (SOLR-5300)
• Shard can be split by a given key (SOLR-5338, SOLR-5353)
• A new ‘migrate’ API to move a user’s data to another (new)
collection without downtime (SOLR-5308)
Solution: Improved data management
• Lucene/Solr is designed for finding top-N search results
• Trying to export full result set brings down the system due
to high memory requirements as you go deeper
Problem #4: Exporting data
Solution: Distributed deep paging
• Performance goals: 6 billion documents, 4000 queries/sec, 400
updates/sec, 2 seconds NRT sustained performance
• 5% large collections (50 shards), 15% medium (10 shards), 85%
small (1 shard) with replication factor of 3
• Target hardware: 24 CPUs, 126G RAM, 7 SSDs (460G) + 1 HDD
(200G)
• 80% traffic served by 20% of the tenants
Testing scale at scale
Test Infrastructure
Logging
Scaling SolrCloud to a large number of Collections
• Tim Potter wrote the Solr Scale Toolkit
• Fabric based tool to setup and manage SolrCloud
clusters in AWS bundled with collectd and SiLK
• Backup/Restore from S3. Parallel clone commands.
• Open source!
• https://github.com/LucidWorks/solr-scale-tk
How to manage large clusters?
• Lucidworks SiLK (Solr + Logstash + Kibana)
• collectd daemons on each host
• rabbitmq to queue messages before delivering to log stash
• Initially started with Kafka but discarded thinking it is overkill
• Not happy with rabbitmq — crashes/unstable
• Might try Kafka again soon
• http://www.lucidworks.com/lucidworks-silk
Gathering metrics and analyzing logs
• Custom randomized data generator (re-producible
using a seed)
• JMeter for generating load
• Embedded CloudSolrServer using JMeter Java
Action Sampler
• JMeter distributed mode was itself a bottleneck!
• Solr scale toolkit has some data generation code
Generating data and load
• 30 hosts, 120 nodes, 1000 collections, 6B+ docs,
15000 queries/second, 2000 writes/second, 2 second
NRT sustained over 24-hours
• More than 3x the numbers we needed
• Unfortunately, we had to stop testing at that point :(
• Our biggest cluster cost us just $120/hour :)
Numbers
• Jepsen tests
• Improvement in test coverage
After those tests
• We continue to test performance
at scale
• Published indexing performance
benchmark, working on others
• 15 nodes, 30 shards, 1 replica,
157195 docs/sec
• 15 nodes, 30 shards, 2
replicas, 61062 docs/sec
And it still goes on…
• Setting up an internal
performance testing environment
• Jenkins CI
• Single node benchmarks
• Cloud tests
• Stay tuned!
Pushing the limits
• SolrCloud continues to be improved
• SOLR-6816 - Review SolrCloud Indexing Performance.
• SOLR-6220 - Replica placement strategy
• SOLR-6273 - Cross data center replication
• SOLR-5750 - Backup/Restore API for SolrCloud
• SOLR-7230 - An API to plugin security into Solr
• Many, many more
Not over yet
Connect @
http://www.twitter.com/anshumgupta
http://www.linkedin.com/in/anshumgupta/
anshum@apache.org

More Related Content

PDF
Deploying and managing Solr at scale
PDF
What's new in Solr 5.0
PDF
Best practices for highly available and large scale SolrCloud
PDF
Ease of use in Apache Solr
PDF
First oslo solr community meetup lightning talk janhoy
PDF
SolrCloud Cluster management via APIs
PDF
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
PDF
Apache Solr 5.0 and beyond
Deploying and managing Solr at scale
What's new in Solr 5.0
Best practices for highly available and large scale SolrCloud
Ease of use in Apache Solr
First oslo solr community meetup lightning talk janhoy
SolrCloud Cluster management via APIs
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Apache Solr 5.0 and beyond

What's hot (20)

PPTX
Managing a SolrCloud cluster using APIs
PDF
What's New in Apache Solr 4.10
PDF
Introduction to SolrCloud
PDF
Inside Solr 5 - Bangalore Solr/Lucene Meetup
PPTX
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
PPTX
Solrcloud Leader Election
PDF
Solr security frameworks
PDF
SolrCloud Failover and Testing
PDF
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
PDF
Solr cluster with SolrCloud at lucenerevolution (tutorial)
PDF
How to make a simple cheap high availability self-healing solr cluster
PDF
Call me maybe: Jepsen and flaky networks
PDF
How SolrCloud Changes the User Experience In a Sharded Environment
PDF
Understanding the Solr security framework - Lucene Solr Revolution 2015
PDF
Scaling search with SolrCloud
PPTX
Solr Exchange: Introduction to SolrCloud
PDF
Intro to Apache Solr
PPTX
Scaling Solr with Solr Cloud
PDF
Cross Datacenter Replication in Apache Solr 6
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Managing a SolrCloud cluster using APIs
What's New in Apache Solr 4.10
Introduction to SolrCloud
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Solrcloud Leader Election
Solr security frameworks
SolrCloud Failover and Testing
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Solr cluster with SolrCloud at lucenerevolution (tutorial)
How to make a simple cheap high availability self-healing solr cluster
Call me maybe: Jepsen and flaky networks
How SolrCloud Changes the User Experience In a Sharded Environment
Understanding the Solr security framework - Lucene Solr Revolution 2015
Scaling search with SolrCloud
Solr Exchange: Introduction to SolrCloud
Intro to Apache Solr
Scaling Solr with Solr Cloud
Cross Datacenter Replication in Apache Solr 6
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Ad

Viewers also liked (19)

PDF
Working with deeply nested documents in Apache Solr
PDF
Scaling Solr with SolrCloud
PDF
Webinar: Fusion for Business Intelligence
PDF
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
PDF
Webinar: Search and Recommenders
PPT
Solr Performance Monitoring with SPM
PDF
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
PDF
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
PDF
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
PDF
it's just search
PDF
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
PDF
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch, Wipro...
PDF
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
PDF
Working with deeply nested documents in Apache Solr
PDF
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
PPTX
Slash n near real time indexing
PDF
Webinar: Replace Google Search Appliance with Lucidworks Fusion
PDF
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
PDF
Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & ...
Working with deeply nested documents in Apache Solr
Scaling Solr with SolrCloud
Webinar: Fusion for Business Intelligence
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Webinar: Search and Recommenders
Solr Performance Monitoring with SPM
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
it's just search
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch, Wipro...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Working with deeply nested documents in Apache Solr
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Slash n near real time indexing
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Solr Highlighting at Full Speed: Presented by Timothy Rodriguez, Bloomberg & ...
Ad

Similar to Scaling SolrCloud to a large number of Collections (20)

PPTX
Benchmarking Solr Performance at Scale
PDF
Hadoop-scale Search with Solr
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
PPTX
Benchmarking Solr Performance
PDF
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
PDF
Autoscaling Solr - Shalin Shekhar Mangar, Lucidworks
PDF
Meet Solr For The Tirst Again
PDF
PPTX
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
PDF
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
PPTX
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
PDF
Solr4 nosql search_server_2013
PDF
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
PPTX
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
PPTX
Solr Lucene Conference 2014 - Nitin Presentation
PDF
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
ODP
GIDS2014: SolrCloud: Searching Big Data
PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
Benchmarking Solr Performance at Scale
Hadoop-scale Search with Solr
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Benchmarking Solr Performance
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Autoscaling Solr - Shalin Shekhar Mangar, Lucidworks
Meet Solr For The Tirst Again
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr4 nosql search_server_2013
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Lucene Conference 2014 - Nitin Presentation
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
GIDS2014: SolrCloud: Searching Big Data
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Real-time Inverted Search in the Cloud Using Lucene and Storm

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
history of c programming in notes for students .pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Wondershare Filmora 15 Crack With Activation Key [2025
Reimagine Home Health with the Power of Agentic AI​
Why Generative AI is the Future of Content, Code & Creativity?
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Design an Analysis of Algorithms I-SECS-1021-03
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Design an Analysis of Algorithms II-SECS-1021-03
Understanding Forklifts - TECH EHS Solution
Adobe Illustrator 28.6 Crack My Vision of Vector Design
history of c programming in notes for students .pptx
CHAPTER 2 - PM Management and IT Context
Odoo Companies in India – Driving Business Transformation.pdf
top salesforce developer skills in 2025.pdf
Designing Intelligence for the Shop Floor.pdf
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Digital Systems & Binary Numbers (comprehensive )
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Upgrade and Innovation Strategies for SAP ERP Customers

Scaling SolrCloud to a large number of Collections

  • 2. Scaling SolrCloud to a Large Number of Collections Anshum Gupta Lucidworks
  • 3. • Anshum Gupta, Apache Lucene/Solr PMC member and committer, Lucidworks Employee. • Interested in search and related stuff. • Apache Lucene since 2006 and Solr since 2010. • Organizations I am or have been a part of: Who am I?
  • 4. Apache Solr is the most widely-used search solution on the planet. Solr has tens of thousands of applications in production. You use everyday. 8,000,000+ Total downloads Solr is both established and growing. 250,000+ Monthly downloads 2,500+ Open Solr jobs and the largest community of developers.
  • 6. The traditional search use-case • One large index distributed across multiple nodes • A large number of users searching on the same data • Searches happen across the entire cluster
  • 7. — Arthur C. Clarke “The limits of the possible can only be defined by going beyond them into the impossible.”
  • 8. • Analyze and find missing features • Setup a performance testing environment on AWS • Devise tests for stability and performance • Find bugs and bottlenecks and fixes Analyze, measure, and optimize
  • 9. • The SolrCloud cluster state has information about all collections, their shards and replicas • All nodes and (Java) clients watch the cluster state • Every state change is notified to all nodes • Limited to (slightly less than) 1MB by default • 1 node restart triggers a few 100 watcher fires and pulls from ZK for a 100 node cluster (three states: down, recovering and active) Problem #1: Cluster state and updates
  • 10. • Each collection gets it’s own state node in ZK • Nodes selectively watch only those states which they are a member of • Clients cache state and use smart cache updates instead of watching nodes • http://issues.apache.org/jira/browse/SOLR-5473 Solution: Split cluster state and scale
  • 11. • Thousands of collections create a lot of state updates • Overseer falls behind and replicas can’t recover or can’t elect a leader • Under high indexing/search load, GC pauses can cause overseer queue to back up Problem #2: Overseer Performance
  • 12. • Optimize polling for new items in overseer queue - Don’t wait to poll! (SOLR-5436) • Dedicated overseers nodes (SOLR-5476) • New Overseer Status API (SOLR-5749) • Asynchronous and multi-threaded execution of collection commands (SOLR-5477, SOLR-5681) Solution - Improve the Overseer
  • 13. • Not all users are born equal - A tenant may have a few very large users • We wanted to be able to scale an individual user’s data — maybe even as it’s own collection • SolrCloud could split shards with no downtime but it only splits in half • No way to ‘extract’ user’s data to another collection or shard Problem #3: Moving data around
  • 14. • Shard can be split on arbitrary hash ranges (SOLR-5300) • Shard can be split by a given key (SOLR-5338, SOLR-5353) • A new ‘migrate’ API to move a user’s data to another (new) collection without downtime (SOLR-5308) Solution: Improved data management
  • 15. • Lucene/Solr is designed for finding top-N search results • Trying to export full result set brings down the system due to high memory requirements as you go deeper Problem #4: Exporting data
  • 17. • Performance goals: 6 billion documents, 4000 queries/sec, 400 updates/sec, 2 seconds NRT sustained performance • 5% large collections (50 shards), 15% medium (10 shards), 85% small (1 shard) with replication factor of 3 • Target hardware: 24 CPUs, 126G RAM, 7 SSDs (460G) + 1 HDD (200G) • 80% traffic served by 20% of the tenants Testing scale at scale
  • 21. • Tim Potter wrote the Solr Scale Toolkit • Fabric based tool to setup and manage SolrCloud clusters in AWS bundled with collectd and SiLK • Backup/Restore from S3. Parallel clone commands. • Open source! • https://github.com/LucidWorks/solr-scale-tk How to manage large clusters?
  • 22. • Lucidworks SiLK (Solr + Logstash + Kibana) • collectd daemons on each host • rabbitmq to queue messages before delivering to log stash • Initially started with Kafka but discarded thinking it is overkill • Not happy with rabbitmq — crashes/unstable • Might try Kafka again soon • http://www.lucidworks.com/lucidworks-silk Gathering metrics and analyzing logs
  • 23. • Custom randomized data generator (re-producible using a seed) • JMeter for generating load • Embedded CloudSolrServer using JMeter Java Action Sampler • JMeter distributed mode was itself a bottleneck! • Solr scale toolkit has some data generation code Generating data and load
  • 24. • 30 hosts, 120 nodes, 1000 collections, 6B+ docs, 15000 queries/second, 2000 writes/second, 2 second NRT sustained over 24-hours • More than 3x the numbers we needed • Unfortunately, we had to stop testing at that point :( • Our biggest cluster cost us just $120/hour :) Numbers
  • 25. • Jepsen tests • Improvement in test coverage After those tests
  • 26. • We continue to test performance at scale • Published indexing performance benchmark, working on others • 15 nodes, 30 shards, 1 replica, 157195 docs/sec • 15 nodes, 30 shards, 2 replicas, 61062 docs/sec And it still goes on… • Setting up an internal performance testing environment • Jenkins CI • Single node benchmarks • Cloud tests • Stay tuned!
  • 28. • SolrCloud continues to be improved • SOLR-6816 - Review SolrCloud Indexing Performance. • SOLR-6220 - Replica placement strategy • SOLR-6273 - Cross data center replication • SOLR-5750 - Backup/Restore API for SolrCloud • SOLR-7230 - An API to plugin security into Solr • Many, many more Not over yet

Editor's Notes

  • #8: Other use cases, different from the general one. Large setup
  • #9: Our plan
  • #19: Didn’t use Zabbix as JMX wasn’t being really useful for us. RabbitMQ instead of Kafka
  • #20: collectd daemon on each of the hosts
  • #21: i2.4xlarge machines
  • #29: 10 x r3.2xlarge nodes, each running 1 instance of Solr 4.8.1 vs 5 35k vs 75k docs/s (130 mn Docs)