SlideShare a Scribd company logo
The riddles of the  Sphinx Full-text engine anatomy atlas
Who are you ? Sphinx – FOSS full-text search engine
Who are you ? Sphinx – FOSS full-text search engine Good at playing ball
Who are you ? Sphinx – FOSS full-text search engine Good at playing ball Good at not playing ball
Who are you ? Sphinx – FOSS full-text search engine Good at playing ball Good at not playing ball Good at passing the ball to a team-mate
Who are you ? Sphinx – FOSS full-text search engine Good at playing ball Good at not playing ball Good at passing the ball to a team-mate Good at many other “inferior” games “ faceted”   search ,  geosearch ,  snippet extraction,   multi-queries,   IO throttling, and 1 0-20  other interesting directives
What are you here for? What will   not   be covered? No entry-level “what’s that Sphinx and what’s in it for me” overview No long quotes from the documentation No C++ architecture details
What are you here for? What will   not   be covered? No entry-level “what’s that Sphinx and what’s in it for me” overview No long quotes from the documentation No C++ architecture details What will be? How does it generally work inside How things can be optimized How things can be parallelized
Chapter  1.  Engine insides
Total   workflow Indexing first Searching second
Total workflow Indexing first Searching second There are data sources (what to fetch, where from) There are indexes What data sources to index How to process the incoming text Where to put the results
How indexing works In two acts ,  with an intermission Phase  1 –  collecting documents Fetch the documents  ( loop over the sources ) Split the documents into words Process the words  ( morphology , * fixes ) Replace the words with their wordid’s (CRC32/64) Emit a number of temp files
How indexing works Phase   2  –  sorting hits Hit  ( occurrence) is a (docid,wordid,wordpos) record Input is a number of partially sorted   (by wordid)   hit lists The incoming lists are merge-sorted Output is essentially a single fully sorted hit list Intermezzo Collect and sort MVA values Sort ordinals Sort extern attributes
Dumb & dumber The index format is… simple Several sorted lists Dictionary  ( the complete list of wordid’s ) Attributes  ( only if docinfo=extern) Document lists  ( for each keyword ) Hit lists  ( for each keyword ) Everything is laid out linearly, good for IO
How searching works For each local index Build a list of candidates  ( documents that satisfy the full-text query ) Filter  ( the analogy is WHERE) Rank  ( compute the documents’ relevance values ) Sort  ( the analogy is ORDER BY) Group  ( the analogy is GROUP BY) Merge the results from all the local indexes
1. Searching cost Building the candidates list 1  keyword =  1 +   IO (document list) Boolean operations on document lists Cost is proportional (~)   to the lists lengths That is ,  to the sum of  all   the keyword frequencies In case of phrase/proximity/etc search ,  there also will be operations on hit lists – approx. 2x   IO/CPU
1. Searching cost Building the candidates list 1  keyword =  1 +   IO (document list) Boolean operations on document lists Cost is proportional (~)   to the lists lengths That is ,  to the sum of  all   the keyword frequencies In case of phrase/proximity/etc search ,  there also are operations on hit lists – approx. 2x   IO/CPU Bottom line – “The Who” are really bad
2. Filtering cost docinfo=inline Attributes are inlined in the document lists ALL the values are duplicated MANY times!   Immediately accessible after disk read docinfo=extern Attributes are stored in a separate list  ( file ) Fully cached in RAM Hashed by docid  +  binary search Simple loop over all filters Cost   ~   number of  candidates   and filters
3. Ranking cost Direct  –  depends on the ranker To account for keyword positions – Helps the relevancy But costs extra resources – double impact ! Cost ~ number of  results Most expensive  –  phrase proximity + BM25 Most cheap  –  none (weight=1) Indirect  –  induced in the sorting
4. Sorting cost Cost ~ number of   results Also depends on the sorting criteria   (documents will be supplied in   @id asc order) Also depends on max_matches The more the max,   the worse the server feels 1-10K is acceptable ,  100K is way too much 10-20  is not enough (makes little sense)
5. Grouping cost Grouping is internally a kind of sorting Cost affected by the number of results, too Cost affected by max_matches, too Additionally ,  max_matches   setting affects @count and   @distinct precision
Chapter  2.  Optimizing things
How to optimize queries Partitioning the data Choosing ranking vs. sorting mode Filters vs. keywords Filters vs. manual MTF Multi queries
How to optimize queries Partitioning the data Choosing ranking vs. sorting mode Filters vs. keywords Filters vs. manual MTF Multi queries Last line of defense – Three Big Buttons
1. Partitioning the data Swiss army knife, for different tasks Bound by indexing time? Partition ,  re-index the recent changes only Bound by filtering ? Partition ,  search the needed indexes only Bound by CPU/HDD? Partition,   move out to different cores/HDDs/boxes
1a. Partitioning vs. indexing Vital to keep the balance right Under-partition  –  and indexing will be slow Over-partition  –  and searching will be slow 1-10  indexes   – work reasonably well Some users are fine with  50+  (30+24...) Some users are fine with 2000+ (!!!)
1b.   Partitioning vs. filtering Totally,  100%  dependent on production query statistics Analyze your very own production logs Add comments if needed   (3 rd   arg to Query()) Justified only if the amount of   processed   data is going to decrease significantly Move out last week’s documents – yes Move out English-only documents – no (!)
1c. Partitioning vs.   CPU/HDD Use a distributed index ,  explicitly map the chunks to  physical  devices Point searchd   “at itself”  – index dist1 { type = distributed local = chunk01 agent = localhost:3312:chunk02 agent = localhost:3312:chunk03 agent = localhost:3312:chunk04 }
1 c. How to find CPU/HDD bottlenecks Three standard tools vmstat – what’s the CPU busy with? how busy is it? oprofile –   specifically who eats the CPU ? iostat –   how busy is the HDD? Also use logs ,  also use searchd --iostats option Normally everything is clear (us/sy/bi/bo…),   but! Caveat  –  HDD might be iops bound Caveat  –  CPU   load from Sphinx might be induced and “hidden” in sy
2. Ranking Can now be very different  ( so called rankers in extended2 mode ) Default ranker  –  phrase+BM25,   accounts for keyword positions  –  not for free Sometimes it’s ok to use simpler ranker Sometimes @weight   is ignored at all  совсем ( searching for ipod ,  sorting by price ) Sometimes you can save on ranker
3. Filters vs. keywords Well-known trick When indexing ,  add a special, fake keyword to the document   (_authorid123) When searching, add it to the query Obvious questions What’s faster, what’s better? Simple answer Count the change before moving away from the cashier
3. Filters vs. keywords Cost of searching   ~   keyword  frequencies Cost of filtering   ~   number of  candidates Searching  –  CPU+IO ,  filtering  –  CPU only Fake keyword frequency =   filter  value   selectivity Frequent value +   few candidates  ->  bad! Rare value  +  many candidates  ->  good!
4. Filters vs. manual MTF Filters are looped over sequentially In the order specified by the app! Narrowest filter  –  better at the start Widest filter  –  better at the end Does not matter if you use fake keywords Exercise to the reader  –  why?
5. Multi-queries Any   queries can be sent together in a batch Always saves on network roundtrip Sometimes allows the optimizer to trigger Especially important and frequent case  – different sorting/grouping modes 2x+ optimization for   “faceted”   searches
5. Multi-queries $client = new SphinxClient (); $q = “laptop”; // coming from website user $client->SetSortMode ( SPH_SORT_EXTENDED, “@weight desc”); $client->AddQuery ( $q, “products” ); $client->SetGroupBy ( SPH_GROUPBY_ATTR, “vendor_id” ); $client->AddQuery ( $q, “products” ); $client->ResetGroupBy (); $client->SetSortMode ( SPH_SORT_EXTENDED, “price asc” ); $client->SetLimit ( 0, 10 ); $result = $client->RunQueries ();
6. Three Big Buttons If nothing else helps … Cutoff ( см . SetLimits()) Forcibly stops searching after first N   matches Per-index, not overall MaxQueryTime ( см.  SetMaxQueryTime()) Forcibly stops searching after M   milli-seconds Per-index, not overall
6. Three Big Buttons If nothing else helps … Consulting   We can notice the unnoticed We can implement the unimplemented
Chapter  3.  Parallelization sample
Combat mission Got   ~160M   cross-links Needed misc reports (by domains -> groupby) *************************** 1. row *************************** domain_id: 440682 link_id: 15 url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 url_to: http://xbox360achievements.org/content/view/101/114/ anchor: NULL from_site_id: 9835 from_forum_id: 1818 from_author_id: 282 from_message_id: 2586 message_published: 2006-09-30 00:00:00 ...
Tackling  –  one Partitioned the data 8  boxes ,  4x CPU, ~5M   links per CPU Used Sphinx In theory, we could had used MySQL It practice, way too complicated Would had resulted in 15-20M+ rows/CPU Would had resulted in “manual” aggregation code
Tackling  –  two Extracted   “interesting parts” of the URL   when indexing, using an UDF Replaced the SELECT with full-text query *************************** 1. row *************************** url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 urlize(url_from,0): www$insidegamer$nl   insidegamer$nl   insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php   insidegamer$nl$forum$viewtopic.php$t=40750 urlize(url_from,1): www$insidegamer$nl   insidegamer$nl   insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php
Tackling  –  three 64   indexes 4  searchd instances   per box ,  by CPU/HDD count 2  indexes (main+delta)   per CPU All searched in parallel Web box   queries the main instance at each box Main instance queries itself and other  3  copies Using  4  instances, because of startup/update Using plain HDDs ,  because of IO stepping
Results The precision is acceptable “ Rare”   domains  –  precise results “ Frequent”   domains  –  precision within 0.5% Average query time  –  0.125 sec 90%   queries  –  under   0.227 sec 95%   queries  –  under   0.352 sec 99%   queries  –  under   2.888 sec
The end

More Related Content

PDF
Elements of Text Mining Part - I
PDF
Python-File handling-slides-pkt
PDF
Text Mining with R -- an Analysis of Twitter Data
PPT
Aidan's PhD Viva
PDF
Berlin Buzzwords 2013 - How does lucene store your data?
PPTX
File Handling Python
PDF
Java - File Input Output Concepts
PDF
What is in a Lucene index?
Elements of Text Mining Part - I
Python-File handling-slides-pkt
Text Mining with R -- an Analysis of Twitter Data
Aidan's PhD Viva
Berlin Buzzwords 2013 - How does lucene store your data?
File Handling Python
Java - File Input Output Concepts
What is in a Lucene index?

What's hot (20)

PPT
Lucece Indexing
PDF
Python-Introduction-slides-pkt
PPTX
working with files
PDF
Working with text data
PPTX
Hadoop Streaming Tutorial With Python
PDF
Text Mining with R
PPTX
Text Mining Infrastructure in R
PPTX
File handling in Python
PDF
Python - Lecture 8
PPTX
Text analytics in Python and R with examples from Tobacco Control
PDF
Python File Handling | File Operations in Python | Learn python programming |...
PPTX
2016 bioinformatics i_bio_python_wimvancriekinge
PPTX
Data file handling
PPTX
Natural Language Processing in R (rNLP)
DOCX
python file handling
PPT
File handling in_c
PPTX
Java file
PDF
File and directories in python
PDF
Python programming : Files
PPTX
EuroPython 2015 - Big Data with Python and Hadoop
Lucece Indexing
Python-Introduction-slides-pkt
working with files
Working with text data
Hadoop Streaming Tutorial With Python
Text Mining with R
Text Mining Infrastructure in R
File handling in Python
Python - Lecture 8
Text analytics in Python and R with examples from Tobacco Control
Python File Handling | File Operations in Python | Learn python programming |...
2016 bioinformatics i_bio_python_wimvancriekinge
Data file handling
Natural Language Processing in R (rNLP)
python file handling
File handling in_c
Java file
File and directories in python
Python programming : Files
EuroPython 2015 - Big Data with Python and Hadoop
Ad

Viewers also liked (20)

ZIP
What Is Esocial Science.Key
PPT
Working with the Australian Curriculum: Geography, Malcolm McInerney, AGTA
PPS
Andalucia. Mayo Del 2007.VIAJE
PDF
Media Sales Pdf
PPT
USA TRIP
PPTX
Through the eyes of young observers: Geographers Imagine, Image and Create Fu...
PPT
Проекти Української мережі Євроклубів
PDF
Guión Litúrgico
PPS
Hartford Meridian Event Photos
PPT
Powerpoint Benchmarks
PPS
EFCA Webinar
PDF
The Starbucks Experience Principle 5
PPTX
Meet yolanda
PDF
The Starbucks Experience Principle 4
PDF
23205011
PDF
記者,你為什麼不反叛?
PPTX
Devising your Data Movement Strategy for IoT
PDF
Worst Album Covers Ever
PDF
WIRA brochure 2010
What Is Esocial Science.Key
Working with the Australian Curriculum: Geography, Malcolm McInerney, AGTA
Andalucia. Mayo Del 2007.VIAJE
Media Sales Pdf
USA TRIP
Through the eyes of young observers: Geographers Imagine, Image and Create Fu...
Проекти Української мережі Євроклубів
Guión Litúrgico
Hartford Meridian Event Photos
Powerpoint Benchmarks
EFCA Webinar
The Starbucks Experience Principle 5
Meet yolanda
The Starbucks Experience Principle 4
23205011
記者,你為什麼不反叛?
Devising your Data Movement Strategy for IoT
Worst Album Covers Ever
WIRA brochure 2010
Ad

Similar to Phpconf2008 Sphinx En (20)

ODP
Optimising Xapian
PDF
Jeff Dean: WSDM09 Keynote
PPT
Www Search Engine But Not In Perl
PDF
Scaling / optimizing search on netlog
ODP
MySQL And Search At Craigslist
PDF
WSDM09-keynote
PPTX
Percona Live London 2014: Serve out any page with an HA Sphinx environment
PDF
Sphinx new
PDF
Charting Searchland, ACM SIG Data Mining
PDF
PDF
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
PPTX
Efficient Query Processing Infrastructures
PDF
Plugin Opensql2008 Sphinx
PDF
My Sql And Search At Craigslist
PPTX
How search engines work Anand Saini
PPTX
Sphinx
PPTX
Web technology: Web search
PDF
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
PPT
Web Search Engine
Optimising Xapian
Jeff Dean: WSDM09 Keynote
Www Search Engine But Not In Perl
Scaling / optimizing search on netlog
MySQL And Search At Craigslist
WSDM09-keynote
Percona Live London 2014: Serve out any page with an HA Sphinx environment
Sphinx new
Charting Searchland, ACM SIG Data Mining
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
Efficient Query Processing Infrastructures
Plugin Opensql2008 Sphinx
My Sql And Search At Craigslist
How search engines work Anand Saini
Sphinx
Web technology: Web search
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Web Search Engine

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Spectroscopy.pptx food analysis technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Digital-Transformation-Roadmap-for-Companies.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectroscopy.pptx food analysis technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Phpconf2008 Sphinx En

  • 1. The riddles of the Sphinx Full-text engine anatomy atlas
  • 2. Who are you ? Sphinx – FOSS full-text search engine
  • 3. Who are you ? Sphinx – FOSS full-text search engine Good at playing ball
  • 4. Who are you ? Sphinx – FOSS full-text search engine Good at playing ball Good at not playing ball
  • 5. Who are you ? Sphinx – FOSS full-text search engine Good at playing ball Good at not playing ball Good at passing the ball to a team-mate
  • 6. Who are you ? Sphinx – FOSS full-text search engine Good at playing ball Good at not playing ball Good at passing the ball to a team-mate Good at many other “inferior” games “ faceted” search , geosearch , snippet extraction, multi-queries, IO throttling, and 1 0-20 other interesting directives
  • 7. What are you here for? What will not be covered? No entry-level “what’s that Sphinx and what’s in it for me” overview No long quotes from the documentation No C++ architecture details
  • 8. What are you here for? What will not be covered? No entry-level “what’s that Sphinx and what’s in it for me” overview No long quotes from the documentation No C++ architecture details What will be? How does it generally work inside How things can be optimized How things can be parallelized
  • 9. Chapter 1. Engine insides
  • 10. Total workflow Indexing first Searching second
  • 11. Total workflow Indexing first Searching second There are data sources (what to fetch, where from) There are indexes What data sources to index How to process the incoming text Where to put the results
  • 12. How indexing works In two acts , with an intermission Phase 1 – collecting documents Fetch the documents ( loop over the sources ) Split the documents into words Process the words ( morphology , * fixes ) Replace the words with their wordid’s (CRC32/64) Emit a number of temp files
  • 13. How indexing works Phase 2 – sorting hits Hit ( occurrence) is a (docid,wordid,wordpos) record Input is a number of partially sorted (by wordid) hit lists The incoming lists are merge-sorted Output is essentially a single fully sorted hit list Intermezzo Collect and sort MVA values Sort ordinals Sort extern attributes
  • 14. Dumb & dumber The index format is… simple Several sorted lists Dictionary ( the complete list of wordid’s ) Attributes ( only if docinfo=extern) Document lists ( for each keyword ) Hit lists ( for each keyword ) Everything is laid out linearly, good for IO
  • 15. How searching works For each local index Build a list of candidates ( documents that satisfy the full-text query ) Filter ( the analogy is WHERE) Rank ( compute the documents’ relevance values ) Sort ( the analogy is ORDER BY) Group ( the analogy is GROUP BY) Merge the results from all the local indexes
  • 16. 1. Searching cost Building the candidates list 1 keyword = 1 + IO (document list) Boolean operations on document lists Cost is proportional (~) to the lists lengths That is , to the sum of all the keyword frequencies In case of phrase/proximity/etc search , there also will be operations on hit lists – approx. 2x IO/CPU
  • 17. 1. Searching cost Building the candidates list 1 keyword = 1 + IO (document list) Boolean operations on document lists Cost is proportional (~) to the lists lengths That is , to the sum of all the keyword frequencies In case of phrase/proximity/etc search , there also are operations on hit lists – approx. 2x IO/CPU Bottom line – “The Who” are really bad
  • 18. 2. Filtering cost docinfo=inline Attributes are inlined in the document lists ALL the values are duplicated MANY times! Immediately accessible after disk read docinfo=extern Attributes are stored in a separate list ( file ) Fully cached in RAM Hashed by docid + binary search Simple loop over all filters Cost ~ number of candidates and filters
  • 19. 3. Ranking cost Direct – depends on the ranker To account for keyword positions – Helps the relevancy But costs extra resources – double impact ! Cost ~ number of results Most expensive – phrase proximity + BM25 Most cheap – none (weight=1) Indirect – induced in the sorting
  • 20. 4. Sorting cost Cost ~ number of results Also depends on the sorting criteria (documents will be supplied in @id asc order) Also depends on max_matches The more the max, the worse the server feels 1-10K is acceptable , 100K is way too much 10-20 is not enough (makes little sense)
  • 21. 5. Grouping cost Grouping is internally a kind of sorting Cost affected by the number of results, too Cost affected by max_matches, too Additionally , max_matches setting affects @count and @distinct precision
  • 22. Chapter 2. Optimizing things
  • 23. How to optimize queries Partitioning the data Choosing ranking vs. sorting mode Filters vs. keywords Filters vs. manual MTF Multi queries
  • 24. How to optimize queries Partitioning the data Choosing ranking vs. sorting mode Filters vs. keywords Filters vs. manual MTF Multi queries Last line of defense – Three Big Buttons
  • 25. 1. Partitioning the data Swiss army knife, for different tasks Bound by indexing time? Partition , re-index the recent changes only Bound by filtering ? Partition , search the needed indexes only Bound by CPU/HDD? Partition, move out to different cores/HDDs/boxes
  • 26. 1a. Partitioning vs. indexing Vital to keep the balance right Under-partition – and indexing will be slow Over-partition – and searching will be slow 1-10 indexes – work reasonably well Some users are fine with 50+ (30+24...) Some users are fine with 2000+ (!!!)
  • 27. 1b. Partitioning vs. filtering Totally, 100% dependent on production query statistics Analyze your very own production logs Add comments if needed (3 rd arg to Query()) Justified only if the amount of processed data is going to decrease significantly Move out last week’s documents – yes Move out English-only documents – no (!)
  • 28. 1c. Partitioning vs. CPU/HDD Use a distributed index , explicitly map the chunks to physical devices Point searchd “at itself” – index dist1 { type = distributed local = chunk01 agent = localhost:3312:chunk02 agent = localhost:3312:chunk03 agent = localhost:3312:chunk04 }
  • 29. 1 c. How to find CPU/HDD bottlenecks Three standard tools vmstat – what’s the CPU busy with? how busy is it? oprofile – specifically who eats the CPU ? iostat – how busy is the HDD? Also use logs , also use searchd --iostats option Normally everything is clear (us/sy/bi/bo…), but! Caveat – HDD might be iops bound Caveat – CPU load from Sphinx might be induced and “hidden” in sy
  • 30. 2. Ranking Can now be very different ( so called rankers in extended2 mode ) Default ranker – phrase+BM25, accounts for keyword positions – not for free Sometimes it’s ok to use simpler ranker Sometimes @weight is ignored at all совсем ( searching for ipod , sorting by price ) Sometimes you can save on ranker
  • 31. 3. Filters vs. keywords Well-known trick When indexing , add a special, fake keyword to the document (_authorid123) When searching, add it to the query Obvious questions What’s faster, what’s better? Simple answer Count the change before moving away from the cashier
  • 32. 3. Filters vs. keywords Cost of searching ~ keyword frequencies Cost of filtering ~ number of candidates Searching – CPU+IO , filtering – CPU only Fake keyword frequency = filter value selectivity Frequent value + few candidates -> bad! Rare value + many candidates -> good!
  • 33. 4. Filters vs. manual MTF Filters are looped over sequentially In the order specified by the app! Narrowest filter – better at the start Widest filter – better at the end Does not matter if you use fake keywords Exercise to the reader – why?
  • 34. 5. Multi-queries Any queries can be sent together in a batch Always saves on network roundtrip Sometimes allows the optimizer to trigger Especially important and frequent case – different sorting/grouping modes 2x+ optimization for “faceted” searches
  • 35. 5. Multi-queries $client = new SphinxClient (); $q = “laptop”; // coming from website user $client->SetSortMode ( SPH_SORT_EXTENDED, “@weight desc”); $client->AddQuery ( $q, “products” ); $client->SetGroupBy ( SPH_GROUPBY_ATTR, “vendor_id” ); $client->AddQuery ( $q, “products” ); $client->ResetGroupBy (); $client->SetSortMode ( SPH_SORT_EXTENDED, “price asc” ); $client->SetLimit ( 0, 10 ); $result = $client->RunQueries ();
  • 36. 6. Three Big Buttons If nothing else helps … Cutoff ( см . SetLimits()) Forcibly stops searching after first N matches Per-index, not overall MaxQueryTime ( см. SetMaxQueryTime()) Forcibly stops searching after M milli-seconds Per-index, not overall
  • 37. 6. Three Big Buttons If nothing else helps … Consulting  We can notice the unnoticed We can implement the unimplemented
  • 38. Chapter 3. Parallelization sample
  • 39. Combat mission Got ~160M cross-links Needed misc reports (by domains -> groupby) *************************** 1. row *************************** domain_id: 440682 link_id: 15 url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 url_to: http://xbox360achievements.org/content/view/101/114/ anchor: NULL from_site_id: 9835 from_forum_id: 1818 from_author_id: 282 from_message_id: 2586 message_published: 2006-09-30 00:00:00 ...
  • 40. Tackling – one Partitioned the data 8 boxes , 4x CPU, ~5M links per CPU Used Sphinx In theory, we could had used MySQL It practice, way too complicated Would had resulted in 15-20M+ rows/CPU Would had resulted in “manual” aggregation code
  • 41. Tackling – two Extracted “interesting parts” of the URL when indexing, using an UDF Replaced the SELECT with full-text query *************************** 1. row *************************** url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 urlize(url_from,0): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php insidegamer$nl$forum$viewtopic.php$t=40750 urlize(url_from,1): www$insidegamer$nl insidegamer$nl insidegamer$nl$forum insidegamer$nl$forum$viewtopic.php
  • 42. Tackling – three 64 indexes 4 searchd instances per box , by CPU/HDD count 2 indexes (main+delta) per CPU All searched in parallel Web box queries the main instance at each box Main instance queries itself and other 3 copies Using 4 instances, because of startup/update Using plain HDDs , because of IO stepping
  • 43. Results The precision is acceptable “ Rare” domains – precise results “ Frequent” domains – precision within 0.5% Average query time – 0.125 sec 90% queries – under 0.227 sec 95% queries – under 0.352 sec 99% queries – under 2.888 sec