SlideShare a Scribd company logo
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain
Building a Large Scale SEO/SEM 
Application with Apache Solr 
Rahul Jain 
Freelance Big-data/Search Consultant 
@rahuldausa 
dynamicrahul2020@gmail.com
About Me… 
• Freelance Big-data/Search Consultant based out of Hyderabad, India 
• Provide Consulting services and solutions for Solr, Elasticsearch and other Big data 
solutions (Apache Hadoop and Spark) 
• Organizer of two Meetup groups in Hyderabad 
• Hyderabad Apache Solr/Lucene 
• Big Data Hyderabad
What I am going to talk 
Share our experience in working on Search in this application … 
• What all issues we have faced and Lessons learned 
• How we do Database Import, Batch Indexing… 
• Techniques to Scale and improve Search latency 
• The System Architecture 
• Some tips for tuning Solr 
• Q/A
What does the Application do 
§ Keyword Research and Competitor Analysis Tool for SEO (Search Engine Optimization) and SEM 
(Search Engine Marketing) Professionals 
§ End user search for a keyword or a domain, and get all insights about that. 
§ Aggregate data for the top 50 results of Google and Bing across 3 countries for 80million+ keywords. 
§ Provide key metrics like keywords, CPM (Cost per mille), CPC (Cost per click), competitor’s details etc. 
Web 
crawling 
Data 
Processing 
& 
Aggrega4on 
Ad 
Networks 
Apis 
Databases 
Data 
Collec4on 
*All 
trademarks 
and 
logos 
belong 
to 
their 
respec1ve 
owners.
Technology Stack
High level Architecture 
Load 
Balancer 
(HAProxy) 
Managed 
Cache 
Apache 
Solr 
Cache 
Cluster 
(Redis) 
Apache 
Solr 
Internet 
Database 
(MySQL) 
App 
Server 
(Tomcat) 
Apache 
Solr 
Search 
Head 
Web 
Server 
Farm 
Php 
App 
(Nginx) 
Cluster 
Manager 
(Custom 
using 
Zookeeper) 
Search 
Head 
: 
• Is 
a 
Solr 
Server 
which 
does 
not 
contain 
any 
data. 
• Make 
a 
Distributed 
Search 
request 
and 
aggregate 
the 
Search 
Results 
• Also 
works 
as 
a 
Load 
Balancer 
for 
search 
queries. 
Apache 
Solr 
Search 
Head 
(Solr) 
1 2 3 
4 
8 
5 
6 
7 
Ids 
lookup 
Cache 
Fetch 
cluster 
Mapping 
for 
which 
month’ 
cluster
Search - Key challenges 
§ After processing we have ~40 billion records every month in MySQL database 
including 
§ 80+ Million Keywords 
§ 110+ Million Domains 
§ 1billion+ URLs 
§ Multiple keywords for a Single URL and vice-versa 
§ Multiple tables with varying size from 50million to 12billion 
§ Search is a core functionality, so all records (selected fields) must be Indexed in Solr 
§ Page load time (including all 24 widgets, Max) < 1.5 sec (Critical) 
§ But… we need to load this data only once every month for all countries, so we can 
do Batch Indexing and as this data never changes, we can apply caching.
Making Data Import and Batch Indexing Faster
Data Import from MySQL to Solr 
• Solr’s DataImportHanlder is awesome but quickly become pretty slow for large volume 
• We wrote our Custom Data Importer that can read(pull) documents from Database and pushses (Async) these into 
Solr. 
Data 
Importer 
(Custom) 
Solr 
Solr 
Solr 
Table 
ID 
(Primary/ 
Unique 
Key 
with 
Index) 
Columns 
1 
Record1 
2 
Record2 
………… 
5000 
Record 
5000 
*6000 
Record 
6000 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
n… 
Record 
n… 
Database 
Batch 
1-­‐2000 
Batch 
2001-­‐4000 
Importer 
batches 
these 
database 
Batches 
into 
a 
Bigger 
Batch 
(10k 
documents) 
and 
Flushes 
to 
selected 
Solr 
servers 
Asynchronously 
in 
a 
round 
robin 
fashion 
Rather 
than 
using 
“limit” 
func4on 
of 
Database, 
it 
queries 
by 
Range 
of 
IDs 
(Primary 
Key). 
Importer 
Fetches 
10 
batches 
at 
a 
4me 
from 
MySQL 
database, 
each 
having 
2k 
Records. 
Each 
call 
is 
Stateless. 
Downside: 
• We 
“select * from table t 
where ID=1 to ID<=2000″ 
“select * from table t 
where ID=2001 to ID<=4000″ 
must 
need 
to 
have 
a 
primary 
key 
and 
that 
can 
be 
slow 
while 
crea4ng 
it 
in 
database. 
• This 
approach 
required 
more 
number 
of 
calls, 
if 
the 
IDs 
are 
not 
sequen4al. 
……… 
*Non-­‐sequen4al 
Id
Batch Indexing
Indexing 
All 
tables 
into 
a 
Single 
Big 
Index 
• All tables in same Index, distributed on multiple Solr cores 
and Solr servers (Java processes) 
• Commit on every 120million records or in every 15 minutes 
whichever is earlier 
• Disabled Soft-commit and updates (overwrite=false), as 
each call to addDocument calls updateDocument under 
the hood 
• But still.. Indexing was slow (due to being sequential for all 
tables) and we need to stop it after 2 days. 
• Search was also awfully slow (order of Minutes) 
From 
cache, 
aber 
warm-­‐up 
Bunch 
of 
shards 
~100
Creating a Distributed Index for each table 
How many shards ? 
• Each table have varying number of records from 50million to 
12billion 
• If we choose 100million per shard (core), it means for 12billion, we 
need to query 120 shards, awfully slow. 
• Other side If we choose 500million/shard, a table with 500million 
records will have only 1 shard, high latency, high memory usage 
(Heap) and no distributed search*. 
• Hybrid Approach : Determine number of shards based on max 
number of records in table. 
• Did a benchmarking to find the best sweet spot for max documents 
(records) per shard with most optimal Search latency 
• Commit at the end for each table. 
Records/Max 
Shards 
Table 
Max 
Number 
of 
Records 
in 
table 
Max 
number 
of 
Shards 
(cores) 
Allowed 
<100 
million 
1 
100-­‐300million 
2 
<500 
million 
4 
< 
1 
billion 
6 
1-­‐5 
billion 
8 
>5 
billion 
16 
* 
Distributed 
Search 
improves 
latency 
but 
may 
not 
be 
faster 
always 
as 
search 
latency 
is 
limited 
by 
4me 
taken 
by 
last 
shard 
in 
responding.
It worked fine but one day suddenly…. 
java.lang.OutOfMemoryError: 
Java heap Space 
• All Solr servers were crashed. 
• We restarted but they keep crashing randomly after every other day 
• Took a Heap dump and realized that it is due to Field Cache 
• Found a Solution : Doc values and never had this issue again till date.
Doc Values (Life Saver) 
• Disk based Field Data a.ka. Doc values 
• Document to value mapping built at index time 
• Store Field values on Disk (or Memory) in a column stride fashion 
• Better compression than Field Cache 
• Replacement of Field Cache (not completely) 
• Quite suitable for Custom Scoring, Sorting and Faceting 
References: 
• Old article (but Good): http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ 
• https://cwiki.apache.org/confluence/display/solr/DocValues 
• http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/doc-values.html
Scaling and Making Search Faster…
Partitioning 
• 3 Level Partitioning, by Month, Country and Table name 
• Each Month has its own Cluster and a Cluster Manager. 
• Latency and Throughput are tradeoff, you can’t have both at the same time. 
Node 
n 
Node 
n 
Web 
server 
Farm 
Load 
Balancer 
App 
Server 
Search 
Head 
(US) 
Search 
Head 
(UK) 
Search 
Head 
(AU) 
Master 
Cluster 
Manager 
Internet 
Cluster 
2 
for 
another 
month 
e.g 
Feb 
Fetch 
Cluster 
Mapping 
and 
make 
a 
request 
to 
Search 
Head 
with 
respec4ve 
Solr 
cores 
for 
that 
Country, 
Month 
and 
Table 
ApApp 
pS 
eSrevrevre 
r 
Cluster 
1 
for 
a 
Month 
e.g. 
Jan 
Solr 
Solr 
Solr 
Cluster 
1 
Cluster 
2 
Solr 
Solr 
Solr 
Node 
1 
Solr 
Solr 
Solr 
Solr 
Solr 
Node 
1 
Solr 
Cluster 
1 
Cluster 
Manager 
Cluster 
Manager 
A 
P 
*A 
: 
Ac4ve 
P 
: 
Passive 
Cluster 
Manager 
Cluster 
Manager 
A 
P 
Real 
4me 
Sync 
1 
user 
24 
UI 
widgets, 
24 
Ajax 
requests 
41 
search 
requests 
Search 
Head 
(US) 
Search 
Head 
(UK) 
Search 
Head 
(AU)
Index Optimization Strategy 
• Running optimization on ~200+ Solr cores is very-very time consuming 
• Solr Cores with bigger Index size (~70GB) have 2500+ segments due to higher Merge Factor while Indexing. 
• Can’ t be run in parallel on all Cores in a Single Machine as heavily dependent on Cpu and Disk IO 
• Optimizing large segments into a very small number is very very time consuming and can take upto 3x Index size on Disk 
• Other side Small number of segments improve performance drastically, so need to have a balance. 
Node 
1 
Solr 
Solr 
Staging 
Cluster 
Manager 
*As 
per 
our 
observa4on 
for 
our 
data, 
Op4miza4on 
process 
takes 
~42-­‐46 
seconds 
for 
1GB 
We 
need 
to 
do 
it 
for 
4.6TB 
(including 
all 
boxes), 
the 
total 
Solr 
Index 
size 
for 
a 
Single 
Month 
Solr 
Op4mizer 
Produc4on 
Cluster 
Manager 
Fetches Cluster 
Mapping (list of all 
cores) 
Once optimization and cache 
warmup is done, pushes the 
Cluster Mapping to Production 
Cluster manager, making all 
Indices live 
Optimizing a Solr core into a very small 
number of segments takes a huge time. 
so we do it iteratively. 
Algo: 
Choose Max 3 cores on a 
Machine to optimize in 
parallel. Start with least size 
of Index 
Index 
Size 
Number 
of 
docs 
Determine 
Max 
Segments 
Allowed 
Reduce 
Segments 
to 
*.90 
in 
each 
Run 
Current 
Segments 
Aber 
op4miza4on 
Node 
2 
Solr 
Solr 
Solr
Finally after optimization and cache warm-up… 
A shard look like this. 
Max 
Segments 
aber 
op4miza4on
External Caching 
• In Distributed search, for a repeated query request, all Solr severs 
needs to be hit, even though result is served from Solr’s cache. It 
increase search latency with lowering throughput. 
• Solution: cache most frequently accessed query results in app layer 
(LRU based eviction) 
• We use Redis for Caching 
• All complex aggregation queries’ results, once fetched from multiple 
Solr servers are served from cache on subsequent requests. 
Why Redis… 
• Advanced In-Memory key-value store 
• Insane fast 
• Response time in order of 5-10ms 
• Provides Cache behavior (set, get) with advance data structures like 
hashes, lists, sets, sorted sets, bitmaps etc. 
• http://redis.io/
Hardware 
• We use Bare Metal, Dedicated servers for Solr due to below reasons 
1. Performance gain (with virtual servers, performance dropped by ~18-20%) 
2. Better value of computing power/$ spent 
• 2.6Ghz, 32 core (4x8 core), 384GB RAM, 6TB SAS 15k (RAID10) 
• 2.6Ghz, 16 core (2x8 core), 192GB RAM, 4TB SAS 15k (RAID10) 
• Since Index size is 4.6TB/month, we want to cache more data in Disk Cache with bigger RAM. 
SSD vs SAS 
1. SSD : Indexing rate - peek (MySQL to Solr) : 330k docs/sec (each doc: ~100-125 bytes) 
2. SAS 15k: 182k docs/sec (dropped by ~45%) 
3. SAS 15k is quite cheaper than SSD for bigger hard disks. 
4. We are using SAS 15k, as being cost effective but have plans to move to SSD in future.
Conclusion : Key takeaways 
General: 
• Understand the characteristics of the data and partition it well. 
Cache: 
§ Spend time in analyzing the Cache usage. Tune them. It is 10x-50x faster. 
§ Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache. 
GC : 
§ Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger 
heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB. 
§ Keep an eye on Garbage collection (GC) logs specially on Full GC. 
Tuning Params: 
§ Don’t use Soft Commit if you don’t need it. Specially in Batch Loading 
§ Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s 
various configurations. 
§ Use hash in Redis to minimize the memory usage. 
Read the whole experience for more detail: 
http://rahuldausa.wordpress.com/2014/05/16/real-time-search-on-40-billion-records-month-with-solr/
Thank you! 
Twitter: @rahuldausa 
dynamicrahul2020@gmail.com 
http://www.linkedin.com/in/rahuldausa

More Related Content

PDF
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
PDF
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
PDF
Solr4 nosql search_server_2013
PDF
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
PDF
Search at Twitter: Presented by Michael Busch, Twitter
PDF
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
PDF
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Solr4 nosql search_server_2013
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Search at Twitter: Presented by Michael Busch, Twitter
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Building a Large Scale SEO/SEM Application with Apache Solr

What's hot (20)

PDF
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
PDF
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
PDF
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
PDF
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
PDF
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
PPTX
Case study of Rujhaan.com (A social news app )
PDF
Parallel SQL and Streaming Expressions in Apache Solr 6
PDF
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
PPTX
Solrcloud Leader Election
PDF
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
PDF
Introduction to SolrCloud
PDF
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
PDF
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PDF
Data Engineering with Solr and Spark
PPTX
Benchmarking Solr Performance at Scale
PDF
Solr + Hadoop = Big Data Search
PDF
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
PDF
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
PDF
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
ODP
GIDS2014: SolrCloud: Searching Big Data
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Case study of Rujhaan.com (A social news app )
Parallel SQL and Streaming Expressions in Apache Solr 6
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Solrcloud Leader Election
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Introduction to SolrCloud
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
Data Engineering with Solr and Spark
Benchmarking Solr Performance at Scale
Solr + Hadoop = Big Data Search
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
GIDS2014: SolrCloud: Searching Big Data
Ad

Viewers also liked (20)

PDF
Search is the UI
PDF
Make your gui shine with ajax solr
PPTX
What's new in Lucene and Solr 4.x
PDF
WebUp Feb 2017 - How (not) to get lost in bigger Ruby on Rails project.
PPTX
Enterprise Search Using Apache Solr
PDF
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
PDF
Solr & Lucene @ Etsy by Gregg Donovan
PDF
Large Scale SEO - Method to the madness
PDF
Apache Solr Workshop
PPT
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
PPTX
Ruby language overview
PDF
Solr Application Development Tutorial
PPTX
Ruby is Awesome and Rust is Awesome and Building a Game in Both is AWESOME
PDF
Apache Solr Search Course Drupal 7 Acquia
PPTX
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
PPTX
Node.js'e Hızlı Bir Bakış
PDF
From Java To Clojure (English version)
PDF
The road to php 7.1
PDF
Fullstack End-to-end test automation with Node.js, one year later
PPT
7 Stages of Scaling Web Applications
Search is the UI
Make your gui shine with ajax solr
What's new in Lucene and Solr 4.x
WebUp Feb 2017 - How (not) to get lost in bigger Ruby on Rails project.
Enterprise Search Using Apache Solr
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Solr & Lucene @ Etsy by Gregg Donovan
Large Scale SEO - Method to the madness
Apache Solr Workshop
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ruby language overview
Solr Application Development Tutorial
Ruby is Awesome and Rust is Awesome and Building a Game in Both is AWESOME
Apache Solr Search Course Drupal 7 Acquia
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Node.js'e Hızlı Bir Bakış
From Java To Clojure (English version)
The road to php 7.1
Fullstack End-to-end test automation with Node.js, one year later
7 Stages of Scaling Web Applications
Ad

Similar to Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain (20)

PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
PDF
High Performance Solr
PDF
Solr @ eBay Kleinanzeigen
PDF
Apache Solr - An Experience Report
KEY
Solr 101
PDF
Basics of Solr and Solr Integration with AEM6
KEY
Apache Solr - Enterprise search platform
PPTX
IT talk SPb "Full text search for lazy guys"
PDF
Consuming RealTime Signals in Solr
PDF
Building a near real time search engine & analytics for logs using solr
PPTX
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
KEY
ApacheCon Europe 2012 -Big Search 4 Big Data
PPTX
Implementing full text search with Apache Solr
PDF
Refactoring a Solr based API application
PDF
Lessons Learned: Refactoring a Solr-Based API App - Torsten Koester
PDF
Tuning Solr for Logs
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
KEY
Big Search with Big Data Principles
PDF
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
PDF
SOLR Power FTW: short version
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
High Performance Solr
Solr @ eBay Kleinanzeigen
Apache Solr - An Experience Report
Solr 101
Basics of Solr and Solr Integration with AEM6
Apache Solr - Enterprise search platform
IT talk SPb "Full text search for lazy guys"
Consuming RealTime Signals in Solr
Building a near real time search engine & analytics for logs using solr
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
ApacheCon Europe 2012 -Big Search 4 Big Data
Implementing full text search with Apache Solr
Refactoring a Solr based API application
Lessons Learned: Refactoring a Solr-Based API App - Torsten Koester
Tuning Solr for Logs
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Big Search with Big Data Principles
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
SOLR Power FTW: short version

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Transform Your Business with a Software ERP System
PPTX
history of c programming in notes for students .pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Digital Strategies for Manufacturing Companies
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Nekopoi APK 2025 free lastest update
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Designing Intelligence for the Shop Floor.pdf
PTS Company Brochure 2025 (1).pdf.......
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Design an Analysis of Algorithms I-SECS-1021-03
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Transform Your Business with a Software ERP System
history of c programming in notes for students .pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo POS Development Services by CandidRoot Solutions
Digital Strategies for Manufacturing Companies
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
CHAPTER 2 - PM Management and IT Context
Operating system designcfffgfgggggggvggggggggg
Nekopoi APK 2025 free lastest update
Design an Analysis of Algorithms II-SECS-1021-03
Reimagine Home Health with the Power of Agentic AI​
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Designing Intelligence for the Shop Floor.pdf

Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

  • 2. Building a Large Scale SEO/SEM Application with Apache Solr Rahul Jain Freelance Big-data/Search Consultant @rahuldausa [email protected]
  • 3. About Me… • Freelance Big-data/Search Consultant based out of Hyderabad, India • Provide Consulting services and solutions for Solr, Elasticsearch and other Big data solutions (Apache Hadoop and Spark) • Organizer of two Meetup groups in Hyderabad • Hyderabad Apache Solr/Lucene • Big Data Hyderabad
  • 4. What I am going to talk Share our experience in working on Search in this application … • What all issues we have faced and Lessons learned • How we do Database Import, Batch Indexing… • Techniques to Scale and improve Search latency • The System Architecture • Some tips for tuning Solr • Q/A
  • 5. What does the Application do § Keyword Research and Competitor Analysis Tool for SEO (Search Engine Optimization) and SEM (Search Engine Marketing) Professionals § End user search for a keyword or a domain, and get all insights about that. § Aggregate data for the top 50 results of Google and Bing across 3 countries for 80million+ keywords. § Provide key metrics like keywords, CPM (Cost per mille), CPC (Cost per click), competitor’s details etc. Web crawling Data Processing & Aggrega4on Ad Networks Apis Databases Data Collec4on *All trademarks and logos belong to their respec1ve owners.
  • 7. High level Architecture Load Balancer (HAProxy) Managed Cache Apache Solr Cache Cluster (Redis) Apache Solr Internet Database (MySQL) App Server (Tomcat) Apache Solr Search Head Web Server Farm Php App (Nginx) Cluster Manager (Custom using Zookeeper) Search Head : • Is a Solr Server which does not contain any data. • Make a Distributed Search request and aggregate the Search Results • Also works as a Load Balancer for search queries. Apache Solr Search Head (Solr) 1 2 3 4 8 5 6 7 Ids lookup Cache Fetch cluster Mapping for which month’ cluster
  • 8. Search - Key challenges § After processing we have ~40 billion records every month in MySQL database including § 80+ Million Keywords § 110+ Million Domains § 1billion+ URLs § Multiple keywords for a Single URL and vice-versa § Multiple tables with varying size from 50million to 12billion § Search is a core functionality, so all records (selected fields) must be Indexed in Solr § Page load time (including all 24 widgets, Max) < 1.5 sec (Critical) § But… we need to load this data only once every month for all countries, so we can do Batch Indexing and as this data never changes, we can apply caching.
  • 9. Making Data Import and Batch Indexing Faster
  • 10. Data Import from MySQL to Solr • Solr’s DataImportHanlder is awesome but quickly become pretty slow for large volume • We wrote our Custom Data Importer that can read(pull) documents from Database and pushses (Async) these into Solr. Data Importer (Custom) Solr Solr Solr Table ID (Primary/ Unique Key with Index) Columns 1 Record1 2 Record2 ………… 5000 Record 5000 *6000 Record 6000 -­‐-­‐-­‐-­‐-­‐-­‐-­‐ n… Record n… Database Batch 1-­‐2000 Batch 2001-­‐4000 Importer batches these database Batches into a Bigger Batch (10k documents) and Flushes to selected Solr servers Asynchronously in a round robin fashion Rather than using “limit” func4on of Database, it queries by Range of IDs (Primary Key). Importer Fetches 10 batches at a 4me from MySQL database, each having 2k Records. Each call is Stateless. Downside: • We “select * from table t where ID=1 to ID<=2000″ “select * from table t where ID=2001 to ID<=4000″ must need to have a primary key and that can be slow while crea4ng it in database. • This approach required more number of calls, if the IDs are not sequen4al. ……… *Non-­‐sequen4al Id
  • 12. Indexing All tables into a Single Big Index • All tables in same Index, distributed on multiple Solr cores and Solr servers (Java processes) • Commit on every 120million records or in every 15 minutes whichever is earlier • Disabled Soft-commit and updates (overwrite=false), as each call to addDocument calls updateDocument under the hood • But still.. Indexing was slow (due to being sequential for all tables) and we need to stop it after 2 days. • Search was also awfully slow (order of Minutes) From cache, aber warm-­‐up Bunch of shards ~100
  • 13. Creating a Distributed Index for each table How many shards ? • Each table have varying number of records from 50million to 12billion • If we choose 100million per shard (core), it means for 12billion, we need to query 120 shards, awfully slow. • Other side If we choose 500million/shard, a table with 500million records will have only 1 shard, high latency, high memory usage (Heap) and no distributed search*. • Hybrid Approach : Determine number of shards based on max number of records in table. • Did a benchmarking to find the best sweet spot for max documents (records) per shard with most optimal Search latency • Commit at the end for each table. Records/Max Shards Table Max Number of Records in table Max number of Shards (cores) Allowed <100 million 1 100-­‐300million 2 <500 million 4 < 1 billion 6 1-­‐5 billion 8 >5 billion 16 * Distributed Search improves latency but may not be faster always as search latency is limited by 4me taken by last shard in responding.
  • 14. It worked fine but one day suddenly…. java.lang.OutOfMemoryError: Java heap Space • All Solr servers were crashed. • We restarted but they keep crashing randomly after every other day • Took a Heap dump and realized that it is due to Field Cache • Found a Solution : Doc values and never had this issue again till date.
  • 15. Doc Values (Life Saver) • Disk based Field Data a.ka. Doc values • Document to value mapping built at index time • Store Field values on Disk (or Memory) in a column stride fashion • Better compression than Field Cache • Replacement of Field Cache (not completely) • Quite suitable for Custom Scoring, Sorting and Faceting References: • Old article (but Good): http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ • https://cwiki.apache.org/confluence/display/solr/DocValues • http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/doc-values.html
  • 16. Scaling and Making Search Faster…
  • 17. Partitioning • 3 Level Partitioning, by Month, Country and Table name • Each Month has its own Cluster and a Cluster Manager. • Latency and Throughput are tradeoff, you can’t have both at the same time. Node n Node n Web server Farm Load Balancer App Server Search Head (US) Search Head (UK) Search Head (AU) Master Cluster Manager Internet Cluster 2 for another month e.g Feb Fetch Cluster Mapping and make a request to Search Head with respec4ve Solr cores for that Country, Month and Table ApApp pS eSrevrevre r Cluster 1 for a Month e.g. Jan Solr Solr Solr Cluster 1 Cluster 2 Solr Solr Solr Node 1 Solr Solr Solr Solr Solr Node 1 Solr Cluster 1 Cluster Manager Cluster Manager A P *A : Ac4ve P : Passive Cluster Manager Cluster Manager A P Real 4me Sync 1 user 24 UI widgets, 24 Ajax requests 41 search requests Search Head (US) Search Head (UK) Search Head (AU)
  • 18. Index Optimization Strategy • Running optimization on ~200+ Solr cores is very-very time consuming • Solr Cores with bigger Index size (~70GB) have 2500+ segments due to higher Merge Factor while Indexing. • Can’ t be run in parallel on all Cores in a Single Machine as heavily dependent on Cpu and Disk IO • Optimizing large segments into a very small number is very very time consuming and can take upto 3x Index size on Disk • Other side Small number of segments improve performance drastically, so need to have a balance. Node 1 Solr Solr Staging Cluster Manager *As per our observa4on for our data, Op4miza4on process takes ~42-­‐46 seconds for 1GB We need to do it for 4.6TB (including all boxes), the total Solr Index size for a Single Month Solr Op4mizer Produc4on Cluster Manager Fetches Cluster Mapping (list of all cores) Once optimization and cache warmup is done, pushes the Cluster Mapping to Production Cluster manager, making all Indices live Optimizing a Solr core into a very small number of segments takes a huge time. so we do it iteratively. Algo: Choose Max 3 cores on a Machine to optimize in parallel. Start with least size of Index Index Size Number of docs Determine Max Segments Allowed Reduce Segments to *.90 in each Run Current Segments Aber op4miza4on Node 2 Solr Solr Solr
  • 19. Finally after optimization and cache warm-up… A shard look like this. Max Segments aber op4miza4on
  • 20. External Caching • In Distributed search, for a repeated query request, all Solr severs needs to be hit, even though result is served from Solr’s cache. It increase search latency with lowering throughput. • Solution: cache most frequently accessed query results in app layer (LRU based eviction) • We use Redis for Caching • All complex aggregation queries’ results, once fetched from multiple Solr servers are served from cache on subsequent requests. Why Redis… • Advanced In-Memory key-value store • Insane fast • Response time in order of 5-10ms • Provides Cache behavior (set, get) with advance data structures like hashes, lists, sets, sorted sets, bitmaps etc. • http://redis.io/
  • 21. Hardware • We use Bare Metal, Dedicated servers for Solr due to below reasons 1. Performance gain (with virtual servers, performance dropped by ~18-20%) 2. Better value of computing power/$ spent • 2.6Ghz, 32 core (4x8 core), 384GB RAM, 6TB SAS 15k (RAID10) • 2.6Ghz, 16 core (2x8 core), 192GB RAM, 4TB SAS 15k (RAID10) • Since Index size is 4.6TB/month, we want to cache more data in Disk Cache with bigger RAM. SSD vs SAS 1. SSD : Indexing rate - peek (MySQL to Solr) : 330k docs/sec (each doc: ~100-125 bytes) 2. SAS 15k: 182k docs/sec (dropped by ~45%) 3. SAS 15k is quite cheaper than SSD for bigger hard disks. 4. We are using SAS 15k, as being cost effective but have plans to move to SSD in future.
  • 22. Conclusion : Key takeaways General: • Understand the characteristics of the data and partition it well. Cache: § Spend time in analyzing the Cache usage. Tune them. It is 10x-50x faster. § Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache. GC : § Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB. § Keep an eye on Garbage collection (GC) logs specially on Full GC. Tuning Params: § Don’t use Soft Commit if you don’t need it. Specially in Batch Loading § Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s various configurations. § Use hash in Redis to minimize the memory usage. Read the whole experience for more detail: http://rahuldausa.wordpress.com/2014/05/16/real-time-search-on-40-billion-records-month-with-solr/
  • 23. Thank you! Twitter: @rahuldausa [email protected] http://www.linkedin.com/in/rahuldausa