SlideShare a Scribd company logo
Searching with Solr Tom Hill [email_address] eBig Java SIG, June 18th, 2008
Tonight's Talk Tonight's Talk should run about 1 1/2 hours About Solr  Background & overview Installing & Bringing Up Solr Rest Interface & Java Client Configuring Solr
Why Implement Search? Does your site need search? Do you need to implement it, or  is Google enough? Just text or Structured Data? Do you need to control ranking?
What is Solr? Web application for text search A wrapper around Apache Lucene  Lucene is a library (.jar file) Solr is a web app (.war file) Written at CNet, now at Apache
What is Lucene? Text search  library  in Java Fast, feature rich. Written by Doug Cutting Active Apache development community Versions also in C++, C#, Ruby, Python, Delphi, Lisp, etc...
Why Solr? Reliable Fast Supported Open Source Tunable Scoring
Solr Versions Current Version is 1.2 A year old 1.3 is coming "sometime" Large number of features in HEAD Use the latest from subversion for new projects
Alternatives to Solr Just Use Google Use Lucene Use Your Database Commercial Libraries Write your own
What Solr is Not A replacement for a relational database An embedded database* Fully cross platform :-( Replication depends on unix FS Admin scripts are bash(minor)
Solr Sites CNet (Reviews & Products) Internet Archive (Collections) Netflix (Movies) Zvents (Events) StripSearch.ws (Comics) And many more
Features Here's a quick look at some of the features of Solr, as implemented on Zvents.com
 
Faceted Navigation Groups the results by category Can do multiple facets at once  Returns matching counts
Additional Constraints
Synonyms, etc.
Solr Overview
Simple Webapp Web Servers[1..n] Database Master Database Slaves[0..n] Solr Master Solr Slaves[0..n]
Scaling Solr Master/Slave architecture Writes to master/reads to slaves Replication: Periodic transfers, not continuous Rsync
Updates Updates flush caches, bad for performance Master therefor much slower than slaves So send all queries to slaves Depends on your update rates
Solr's Data Model Solr maintains a collection of documents A document is a collection of fields & values A field can occur multiple times in a document Documents are immutable.  They can be deleted, and a new version added, however.
Querying Http request http://localhost:8080/comix/select/?q=java
Solr Query Syntax Lucene Query Syntax + a bit paris city:paris title:"The Right Way" AND text:go id:[* TO *]
Solr Query Syntax II -inStock:false te?t theat* te*t test~
Using Solr Getting data into Solr Getting data out of Solr
Getting Data Into Solr POST it. <add> <doc> <field name=&quot;employeeId&quot;>05991</field> <field name=&quot;office&quot;>Bridgewater</field> <field name=&quot;skills&quot;>Perl</field> <field name=&quot;skills&quot;>Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add>
Getting Data Into Solr POST it. <add> < doc > <field name=&quot;employeeId&quot;>05991</field> <field name=&quot;office&quot;>Bridgewater</field> <field name=&quot;skills&quot;>Perl</field> <field name=&quot;skills&quot;>Java</field> </ doc > [<doc> ... </doc>[<doc> ... </doc>]] </add>
Getting Data Into Solr POST it. <add> <doc> <field name=&quot; employeeId &quot;> 05991 </field> <field name=&quot;office&quot;>Bridgewater</field> <field name=&quot;skills&quot;>Perl</field> <field name=&quot;skills&quot;>Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add>
Committing Nothing shows up in the index until you commit You can just POST <commit/> to  http:// host : port /solr/update
Getting Data Out http://localhost:8080/comix/select/?q=data&indent=on <response> <lst name=&quot;responseHeader&quot;> <int name=&quot;status&quot;>0</int> <int name=&quot;QTime&quot;>0</int> <lst name=&quot;params&quot;> <str name=&quot;indent&quot;>on</str> <str name=&quot;q&quot;>data</str>  </lst> </lst> <result name=&quot;response&quot; numFound=&quot;2&quot; start=&quot;0&quot;> <doc> <str name=&quot;id&quot;>strip.3136</str> <str name=&quot;release_date&quot;>1992-05-07</str> <date name=&quot;timestamp&quot;>2008-02-28T10:06:01.682Z</date> <str name=&quot;type&quot;>strip</str> </doc> </result> </response>
Getting Data Out http://localhost:8080/comix/select/?q=data&indent=on <response> <lst name=&quot;responseHeader&quot;> <int name=&quot;status&quot;>0</int> <int name=&quot;QTime&quot;>0</int> <lst name=&quot;params&quot;> <str name=&quot;indent&quot;>on</str> <str name=&quot;q&quot;>data</str> </lst> </lst> <result name=&quot;response&quot; numFound=&quot;2&quot; start=&quot;0&quot;> <doc> <str name=&quot;id&quot;>strip.3136</str> <str name=&quot;release_date&quot;>1992-05-07</str> <date name=&quot;timestamp&quot;>2008-02-28T10:06:01.682Z</date> <str name=&quot;type&quot;>strip</str> </doc> </result> </response>
Getting Data Out http://localhost:8080/comix/select/?q=data&indent=on <response> <lst name=&quot;responseHeader&quot;> <int name=&quot;status&quot;>0</int> <int name=&quot;QTime&quot;>0</int> <lst name=&quot;params&quot;> <str name=&quot;indent&quot;>on</str> <str name=&quot;q&quot;>data</str> </lst> </lst> <result name=&quot;response&quot; numFound=&quot;2&quot; start=&quot;0&quot;> <doc> <str name=&quot;id&quot;>strip.3136</str> <str name=&quot;release_date&quot;>1992-05-07</str> <date name=&quot;timestamp&quot;>2008-02-28T10:06:01.682Z</date> <str name=&quot;type&quot;>strip</str> </doc> ... </result> </response>
Getting Data Out http://localhost:8080/comix/select/?q=data&indent=on { &quot;responseHeader&quot;:{ &quot;status&quot;:0, &quot;QTime&quot;:1, &quot;params&quot;:{ &quot;wt&quot;:&quot;json&quot;, &quot;rows&quot;:[&quot;1&quot;,   &quot;1&quot;], &quot;start&quot;:&quot;0&quot;, &quot;indent&quot;:&quot;on&quot;, &quot;q&quot;:&quot;data&quot;, &quot;version&quot;:&quot;2.2&quot;}}, &quot;response&quot;:{&quot;numFound&quot;:2,&quot;start&quot;:0,&quot;docs&quot;:[ {   &quot;feature_id&quot;:&quot;3&quot;,   &quot;release_date&quot;:&quot;1992-05-07&quot;,   &quot;id&quot;:&quot;strip.3136&quot;,   &quot;timestamp&quot;:&quot;2008-02-28T10:06:01.682Z&quot;}] }} JSON format
Debug Query Option Add  &debugQuery=on  to request params Returns parsed form of query <str name=&quot;rawquerystring&quot;>c.i.a</str><str name=&quot;querystring&quot;>c.i.a</str><str name=&quot;parsedquery&quot;>PhraseQuery(text:&quot;c i a&quot;)</str><str name=&quot;parsedquery_toString&quot;>text:&quot;c i a&quot;</str>
Debug Query Option II Add  &debugQuery=on  to request params Returns scoring information <str name=&quot;id=strip.2781,internal_docid=29854&quot;> 2.6219895 = (MATCH) fieldWeight(text:calvin in 29854), product of: 1.0 = tf(termFreq(text:calvin)=1) 2.6219895 = idf(docFreq=6222) 1.0 = fieldNorm(field=text, doc=29854) </str> <str name=&quot;id=strip.4078,internal_docid=31151&quot;> 2.6219895 = (MATCH) fieldWeight(text:calvin in 31151), product of: 1.0 = tf(termFreq(text:calvin)=1) 2.6219895 = idf(docFreq=6222) 1.0 = fieldNorm(field=text, doc=31151) </str>
Deleting Data POST  <delete><id>35</id></delete> <delete><query>city:paris</query></delete>
Command Line Control curl  http://localhost:8983/solr/update  -H &quot;Content-type: text/xml&quot; --data-binary '<commit/>' <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?><response><lst name=&quot;responseHeader&quot;> <int name=&quot;status&quot;>0</int> <int name=&quot;QTime&quot;>20</int> </lst></response> </lst></response> </lst></response>
Solr in 3 minutes! Download Solr from Apache Untar &quot;ant example&quot; Start the example app Load data into Solr Query
Solr in Ten Minutes <Context docBase=&quot;/var/solr/apache-solr-1.2.0.war&quot; debug=&quot;0&quot; crossContext=&quot;true&quot; >  <Environment name=&quot;solr/home&quot; type=&quot;java.lang.String&quot; value=&quot;/var/solr&quot; override=&quot;true&quot; /></Context> Copy Solr's example/solr dir to /var/solr  Edit schema.xml and solrconfig.xml Load data into Solr In $CATALINA_HOME/conf/Catalina/localhost/foo.xml
Directory Layout ${solr.home}/conf schema.xml solrconfig.xml ${solr.home}/data ${solr.home}/logs ${solr.home}/bin
Java Solr Client Called SolrJ Not in Solr 1.2.  I grabbed from the HEAD from svn Works with Solr 1.2 Add/Delete/Query/Commit/Optimize
Adding Docs w/SolrJ Given Map<String, String> fields; CommonsHttpSolrServer  server  =  new  CommonsHttpSolrServer( url ); SolrInputDocument doc= new  SolrInputDocument(); for  (Map.Entry<String, String> e : fields.entrySet()){ doc.addField(e.getKey(), e.getValue()); } UpdateResponse res =  server .add( doc);
Deleting Docs w/SolrJ CommonsHttpSolrServer  server  =  new  CommonsHttpSolrServer( url ); UpdateResponse res; res = server .deleteById(&quot;100&quot;); res = server .deleteByQuery(&quot;city:paris&quot;);
Simple Query CommonsHttpSolrServer  server = new  CommonsHttpSolrServer( url ); SolrQuery query =  new  SolrQuery(); query.setQuery(&quot;dance&quot;); QueryResponse rsp =  server .query(query);
More Interesting Query CommonsHttpSolrServer  server  =  new  CommonsHttpSolrServer( url ); SolrQuery query =  new  SolrQuery(); query.setQuery(&quot;dance&quot;); query.setFacet( true ); query.addFacetField(&quot;city&quot;); query.setFacetMinCount(1); query.addSortField( &quot;price&quot;, SolrQuery.ORDER.asc ); QueryResponse rsp =  server .query(query);
Query Responses QueryResponse qr =  server .query(query); SolrDocumentList docs = qr.getResults(); List<FacetField> lf = qr.getFacetFields(); for  (FacetField ff: lf) { String fieldName = ff.getName(); List<FacetField.Count> lc = ff.getValues(); for  (FacetField.Count c: lc) { String countName = c.getName(); long count = c.getCount();   } }
Other Commands Commit server.commit() Optimize server.optimize() Not too complicated!
Request Handlers Request handler define how the query is processed. Two main types StandardRequestHandler DisMaxRequestHandler You can implement your own Changing in Solr 1.3
&quot;Standard&quot; Request Handler Accepts Solr Query Syntax I tend to use it for my queries, not user queries.
DisMaxRequestHandler Recommended for user queries Allows simple users keywords to be applied to multiple fields, with weighting. Boost Functions Boost Queries
Boost Functions Allow you to influence scoring at run time Computationally Expensive! Really useful for tuning scoring linear(x,2,4) returns 2*x+4 x is a field
The Solr Schema schema.xml Defines types used in this webapp Defines the fields and their types Defines &quot;copyFields&quot; READ THE EXAMPLE SCHEMA.XML
Types Types define processing for a field How the words are split (Whitespace? Punctuation? CIA != C.I.A.) Stemming Case Folding, etc Predefined date, int, float, etc c
Analysis: Index and Query Time Types have two modes Index Time Query Time
Simple Text Field <fieldType name=&quot;text&quot; class=&quot;solr.TextField&quot;  positionIncrementGap= &quot;100&quot;>  <analyzer type=&quot;index&quot;>  <tokenizer class=&quot;solr.WhitespaceTokenizerFactory&quot;/>  <filter class=&quot;solr.StopFilterFactory&quot; ignoreCase=&quot;true&quot; words=&quot;stopwords.txt&quot;/></analyzer><analyzer type=&quot;query&quot;><tokenizer class=&quot;solr.WhitespaceTokenizerFactory&quot;/> <filter class=&quot;solr.SynonymFilterFactory&quot; synonyms=&quot;synonyms.txt&quot;  ignoreCase=&quot;true&quot; expand=&quot;true&quot;/> <filter class=&quot;solr.StopFilterFactory&quot; ignoreCase=&quot;true&quot; words=&quot;stopwords.txt&quot;/></analyzer></fieldType> <filter class=&quot;solr.StopFilterFactory&quot; ignoreCase=&quot;true&quot; words=&quot;stopwords.txt&quot;/></analyzer></fieldType> <filter class=&quot;solr.StopFilterFactory&quot; ignoreCase=&quot;true&quot; words=&quot;stopwords.txt&quot;/></analyzer></fieldType>
Analysis & Facets Make sure to use an untokenized field for faceting. &quot;San Jose&quot; != &quot;San&quot; &quot;Jose&quot;
Fields Elements of a document Both predefined & dynamic Fields may occur multiple times Maybe indexed and/or stored
Example Fields <field name=&quot;id&quot; type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;true&quot; required=&quot;true&quot; /><field name=&quot;name&quot; type=&quot;text&quot; indexed=&quot;true&quot; stored=&quot;true&quot;/><field name=&quot;alphaNameSort&quot; type=&quot;alphaOnlySort&quot; indexed=&quot;true&quot; stored=&quot;false&quot;/>
Copy Fields Two main uses To analyze a field in two different ways To concatenate fields
The Solr Config File solrconfig.xml Defines request handlers, defaults, caches,  Read the example solrconfig.xml
Configuring DisMax Parameter defaults set in solrconfig.xml Can be overridden in each request Except for params labeled invariant
DisMax Config Example <requestHandler name=&quot;dismax&quot; class=&quot;solr.DisMaxRequestHandler&quot; >  <lst name=&quot;defaults&quot;>  <str name=&quot;echoParams&quot;>explicit</str>  <float name=&quot;tie&quot;>0.01</float>  <str name=&quot;qf&quot;>  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4  </str>  <str name=&quot;pf&quot;>  text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9  </str> ... </requestHandler>
DisMax Config Example <requestHandler name=&quot;dismax&quot; class=&quot;solr.DisMaxRequestHandler&quot; >  ... <str name=&quot;bf&quot;>  ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3  </str>  <str name=&quot;fl&quot;>  id,name,price,score  </str>...  </requestHandler>
DisMax Config Example <requestHandler name=&quot;dismax&quot; class=&quot;solr.DisMaxRequestHandler&quot; >  ... <str name=&quot;mm&quot;>  2&lt;-1 5&lt;-2 6&lt;90%  </str>  <int name=&quot;ps&quot;>100</int>  <str name=&quot;q.alt&quot;>*:*</str>  </lst>  </requestHandler>
Wrap Up
Resources Solr  http://lucene.apache.org/solr/ wiki, mailing list, jira (bugs/features) Lucene  http://lucene.apache.org /
Lucene In Action
Building Search Applications with Lucene, lingpipe and Gate Manu Konchady Manu Konchady Manu Konchady
Other Presentations Yonik Seely's Solr & Lucene http://people.apache.org/~yonik/presentations/ Slideshare.net Search for solr, or search for lucene
Thanks! Thanks for coming. Feel free to email me if you have questions about Solr Tom Hill [email_address]
Extra Slides Things I didn't have time for in the presentation. Some of them unfinished.
Search Engines are not the Same as Users Search engines have different usage patterns than users
Response Writers http://localhost:8983/solr/select/?q=text_t%3Atiger&version=2.2&start=0&rows=10&indent=on& wt=ruby http://localhost:8983/solr/select/?q=text_t%3Atiger&version=2.2&start=0&rows=10&indent=on& wt=xml
Explain Just why did the documents come up in that order?
Data Matters Gigo The better the data is, the better the search will be.
Watch Your Caches Just like any other app, check your statistics What's the hit rate for your caches?
Setting Up Replication Run rsyncd on the master Run snapshot on the master at intervals Run snappuller on the slaves at (different) intervals. Scripts don't print errors! Check the logs Use bash -xv
Autowarming Runs after an update to the index Updates flush caches Runs some queries to populate caches again Can be a problem, with frequent updates Don't autowarm master, if updating lots
Tour Of Solr's Web UI
Programming Collective Intelligence A Really Fun Book
Geographic Searching Local Lucene & Local Solr http://locallucene.wiki.sourceforge.net There's also geolucene, but it's not being actively developed, as far as I can tell. http://www.gossamer-threads.com/lists/l ucene/java-dev/53378
http://localhost:8983/solr/admin/stats.jsp#update Are there commits pending?
http://localhost:8983/comix/admin/analysis.jsp?name=text&val=wi-fi Analysis Explanation

More Related Content

PDF
Introduction to Apache Solr
PPT
Solr Presentation
PDF
The Apache Spark File Format Ecosystem
PPTX
Introduction to Sharding
PPTX
Understanding and tuning WiredTiger, the new high performance database engine...
PDF
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
PPT
HBASE Overview
PPT
Introduction to HTML5
Introduction to Apache Solr
Solr Presentation
The Apache Spark File Format Ecosystem
Introduction to Sharding
Understanding and tuning WiredTiger, the new high performance database engine...
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
HBASE Overview
Introduction to HTML5

What's hot (20)

PDF
Introduction to HBase
PPTX
An Introduction To NoSQL & MongoDB
PPSX
Introduction to css
PDF
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
PPSX
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
PDF
Introduction to elasticsearch
PPTX
Introduction To HBase
PPTX
Apache Tez - A unifying Framework for Hadoop Data Processing
PDF
MariaDB ColumnStore
PPTX
Introduction to MongoDB
PPTX
Apache Solr
PPTX
Apache HBase™
PPTX
MongoDB presentation
PPTX
Apache Atlas: Tracking dataset lineage across Hadoop components
PDF
Bootstrap
PDF
Streaming SQL with Apache Calcite
PDF
Understanding the architecture of MariaDB ColumnStore
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
PPTX
Bootstrap 3
PDF
Cassandra Database
Introduction to HBase
An Introduction To NoSQL & MongoDB
Introduction to css
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
Introduction to elasticsearch
Introduction To HBase
Apache Tez - A unifying Framework for Hadoop Data Processing
MariaDB ColumnStore
Introduction to MongoDB
Apache Solr
Apache HBase™
MongoDB presentation
Apache Atlas: Tracking dataset lineage across Hadoop components
Bootstrap
Streaming SQL with Apache Calcite
Understanding the architecture of MariaDB ColumnStore
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Bootstrap 3
Cassandra Database
Ad

Viewers also liked (19)

PPTX
Introduction to Apache Lucene/Solr
PPT
Introduction to Apache Solr.
PPTX
Introduction to Apache Solr
PDF
Apache Solr crash course
PDF
Understanding and visualizing solr explain information - Rafal Kuc
PDF
New-Age Search through Apache Solr
PDF
Introduction to Solr
PDF
Faceted Search with Lucene
PPTX
Tutorial on developing a Solr search component plugin
PDF
Solr Powered Lucene
PDF
Apache Solr Workshop
PDF
Semantic & Multilingual Strategies in Lucene/Solr
PDF
Language support and linguistics in lucene solr & its eco system
PDF
Apache Solr/Lucene Internals by Anatoliy Sokolenko
PDF
Using Apache Solr
PDF
Introduction to Apache Solr
PDF
Building a real time big data analytics platform with solr
PDF
Indexing Text and HTML Files with Solr
PDF
What is in a Lucene index?
Introduction to Apache Lucene/Solr
Introduction to Apache Solr.
Introduction to Apache Solr
Apache Solr crash course
Understanding and visualizing solr explain information - Rafal Kuc
New-Age Search through Apache Solr
Introduction to Solr
Faceted Search with Lucene
Tutorial on developing a Solr search component plugin
Solr Powered Lucene
Apache Solr Workshop
Semantic & Multilingual Strategies in Lucene/Solr
Language support and linguistics in lucene solr & its eco system
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Using Apache Solr
Introduction to Apache Solr
Building a real time big data analytics platform with solr
Indexing Text and HTML Files with Solr
What is in a Lucene index?
Ad

Similar to An Introduction to Solr (20)

ODP
Solr: Enterprise Search Server
PDF
Solr search engine with multiple table relation
KEY
Apache Solr - Enterprise search platform
PDF
Rapid Prototyping with Solr
PDF
Search Engine-Building with Lucene and Solr
PDF
Rapid Prototyping with Solr
PPT
Introduction to Search Engines
PPTX
Apache Solr Workshop
DOCX
Apache solr tech doc
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
PDF
Apache Solr! Enterprise Search Solutions at your Fingertips!
PDF
Beyond full-text searches with Lucene and Solr
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PDF
Introduction to Solr
PDF
Apache solr liferay
PPTX
Solr introduction
PDF
Basics of Solr and Solr Integration with AEM6
ODP
Dev8d Apache Solr Tutorial
PPT
Apache Lucene Searching The Web
Solr: Enterprise Search Server
Solr search engine with multiple table relation
Apache Solr - Enterprise search platform
Rapid Prototyping with Solr
Search Engine-Building with Lucene and Solr
Rapid Prototyping with Solr
Introduction to Search Engines
Apache Solr Workshop
Apache solr tech doc
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Apache Solr! Enterprise Search Solutions at your Fingertips!
Beyond full-text searches with Lucene and Solr
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Introduction to Solr
Apache solr liferay
Solr introduction
Basics of Solr and Solr Integration with AEM6
Dev8d Apache Solr Tutorial
Apache Lucene Searching The Web

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Spectroscopy.pptx food analysis technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Tartificialntelligence_presentation.pptx
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Big Data Technologies - Introduction.pptx
Programs and apps: productivity, graphics, security and other tools
Group 1 Presentation -Planning and Decision Making .pptx
Getting Started with Data Integration: FME Form 101
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Spectroscopy.pptx food analysis technology
MIND Revenue Release Quarter 2 2025 Press Release
Tartificialntelligence_presentation.pptx

An Introduction to Solr

  • 1. Searching with Solr Tom Hill [email_address] eBig Java SIG, June 18th, 2008
  • 2. Tonight's Talk Tonight's Talk should run about 1 1/2 hours About Solr Background & overview Installing & Bringing Up Solr Rest Interface & Java Client Configuring Solr
  • 3. Why Implement Search? Does your site need search? Do you need to implement it, or is Google enough? Just text or Structured Data? Do you need to control ranking?
  • 4. What is Solr? Web application for text search A wrapper around Apache Lucene Lucene is a library (.jar file) Solr is a web app (.war file) Written at CNet, now at Apache
  • 5. What is Lucene? Text search library in Java Fast, feature rich. Written by Doug Cutting Active Apache development community Versions also in C++, C#, Ruby, Python, Delphi, Lisp, etc...
  • 6. Why Solr? Reliable Fast Supported Open Source Tunable Scoring
  • 7. Solr Versions Current Version is 1.2 A year old 1.3 is coming &quot;sometime&quot; Large number of features in HEAD Use the latest from subversion for new projects
  • 8. Alternatives to Solr Just Use Google Use Lucene Use Your Database Commercial Libraries Write your own
  • 9. What Solr is Not A replacement for a relational database An embedded database* Fully cross platform :-( Replication depends on unix FS Admin scripts are bash(minor)
  • 10. Solr Sites CNet (Reviews & Products) Internet Archive (Collections) Netflix (Movies) Zvents (Events) StripSearch.ws (Comics) And many more
  • 11. Features Here's a quick look at some of the features of Solr, as implemented on Zvents.com
  • 12.  
  • 13. Faceted Navigation Groups the results by category Can do multiple facets at once Returns matching counts
  • 17. Simple Webapp Web Servers[1..n] Database Master Database Slaves[0..n] Solr Master Solr Slaves[0..n]
  • 18. Scaling Solr Master/Slave architecture Writes to master/reads to slaves Replication: Periodic transfers, not continuous Rsync
  • 19. Updates Updates flush caches, bad for performance Master therefor much slower than slaves So send all queries to slaves Depends on your update rates
  • 20. Solr's Data Model Solr maintains a collection of documents A document is a collection of fields & values A field can occur multiple times in a document Documents are immutable. They can be deleted, and a new version added, however.
  • 21. Querying Http request http://localhost:8080/comix/select/?q=java
  • 22. Solr Query Syntax Lucene Query Syntax + a bit paris city:paris title:&quot;The Right Way&quot; AND text:go id:[* TO *]
  • 23. Solr Query Syntax II -inStock:false te?t theat* te*t test~
  • 24. Using Solr Getting data into Solr Getting data out of Solr
  • 25. Getting Data Into Solr POST it. <add> <doc> <field name=&quot;employeeId&quot;>05991</field> <field name=&quot;office&quot;>Bridgewater</field> <field name=&quot;skills&quot;>Perl</field> <field name=&quot;skills&quot;>Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add>
  • 26. Getting Data Into Solr POST it. <add> < doc > <field name=&quot;employeeId&quot;>05991</field> <field name=&quot;office&quot;>Bridgewater</field> <field name=&quot;skills&quot;>Perl</field> <field name=&quot;skills&quot;>Java</field> </ doc > [<doc> ... </doc>[<doc> ... </doc>]] </add>
  • 27. Getting Data Into Solr POST it. <add> <doc> <field name=&quot; employeeId &quot;> 05991 </field> <field name=&quot;office&quot;>Bridgewater</field> <field name=&quot;skills&quot;>Perl</field> <field name=&quot;skills&quot;>Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add>
  • 28. Committing Nothing shows up in the index until you commit You can just POST <commit/> to http:// host : port /solr/update
  • 29. Getting Data Out http://localhost:8080/comix/select/?q=data&indent=on <response> <lst name=&quot;responseHeader&quot;> <int name=&quot;status&quot;>0</int> <int name=&quot;QTime&quot;>0</int> <lst name=&quot;params&quot;> <str name=&quot;indent&quot;>on</str> <str name=&quot;q&quot;>data</str> </lst> </lst> <result name=&quot;response&quot; numFound=&quot;2&quot; start=&quot;0&quot;> <doc> <str name=&quot;id&quot;>strip.3136</str> <str name=&quot;release_date&quot;>1992-05-07</str> <date name=&quot;timestamp&quot;>2008-02-28T10:06:01.682Z</date> <str name=&quot;type&quot;>strip</str> </doc> </result> </response>
  • 30. Getting Data Out http://localhost:8080/comix/select/?q=data&indent=on <response> <lst name=&quot;responseHeader&quot;> <int name=&quot;status&quot;>0</int> <int name=&quot;QTime&quot;>0</int> <lst name=&quot;params&quot;> <str name=&quot;indent&quot;>on</str> <str name=&quot;q&quot;>data</str> </lst> </lst> <result name=&quot;response&quot; numFound=&quot;2&quot; start=&quot;0&quot;> <doc> <str name=&quot;id&quot;>strip.3136</str> <str name=&quot;release_date&quot;>1992-05-07</str> <date name=&quot;timestamp&quot;>2008-02-28T10:06:01.682Z</date> <str name=&quot;type&quot;>strip</str> </doc> </result> </response>
  • 31. Getting Data Out http://localhost:8080/comix/select/?q=data&indent=on <response> <lst name=&quot;responseHeader&quot;> <int name=&quot;status&quot;>0</int> <int name=&quot;QTime&quot;>0</int> <lst name=&quot;params&quot;> <str name=&quot;indent&quot;>on</str> <str name=&quot;q&quot;>data</str> </lst> </lst> <result name=&quot;response&quot; numFound=&quot;2&quot; start=&quot;0&quot;> <doc> <str name=&quot;id&quot;>strip.3136</str> <str name=&quot;release_date&quot;>1992-05-07</str> <date name=&quot;timestamp&quot;>2008-02-28T10:06:01.682Z</date> <str name=&quot;type&quot;>strip</str> </doc> ... </result> </response>
  • 32. Getting Data Out http://localhost:8080/comix/select/?q=data&indent=on { &quot;responseHeader&quot;:{ &quot;status&quot;:0, &quot;QTime&quot;:1, &quot;params&quot;:{ &quot;wt&quot;:&quot;json&quot;, &quot;rows&quot;:[&quot;1&quot;, &quot;1&quot;], &quot;start&quot;:&quot;0&quot;, &quot;indent&quot;:&quot;on&quot;, &quot;q&quot;:&quot;data&quot;, &quot;version&quot;:&quot;2.2&quot;}}, &quot;response&quot;:{&quot;numFound&quot;:2,&quot;start&quot;:0,&quot;docs&quot;:[ { &quot;feature_id&quot;:&quot;3&quot;, &quot;release_date&quot;:&quot;1992-05-07&quot;, &quot;id&quot;:&quot;strip.3136&quot;, &quot;timestamp&quot;:&quot;2008-02-28T10:06:01.682Z&quot;}] }} JSON format
  • 33. Debug Query Option Add &debugQuery=on to request params Returns parsed form of query <str name=&quot;rawquerystring&quot;>c.i.a</str><str name=&quot;querystring&quot;>c.i.a</str><str name=&quot;parsedquery&quot;>PhraseQuery(text:&quot;c i a&quot;)</str><str name=&quot;parsedquery_toString&quot;>text:&quot;c i a&quot;</str>
  • 34. Debug Query Option II Add &debugQuery=on to request params Returns scoring information <str name=&quot;id=strip.2781,internal_docid=29854&quot;> 2.6219895 = (MATCH) fieldWeight(text:calvin in 29854), product of: 1.0 = tf(termFreq(text:calvin)=1) 2.6219895 = idf(docFreq=6222) 1.0 = fieldNorm(field=text, doc=29854) </str> <str name=&quot;id=strip.4078,internal_docid=31151&quot;> 2.6219895 = (MATCH) fieldWeight(text:calvin in 31151), product of: 1.0 = tf(termFreq(text:calvin)=1) 2.6219895 = idf(docFreq=6222) 1.0 = fieldNorm(field=text, doc=31151) </str>
  • 35. Deleting Data POST <delete><id>35</id></delete> <delete><query>city:paris</query></delete>
  • 36. Command Line Control curl http://localhost:8983/solr/update -H &quot;Content-type: text/xml&quot; --data-binary '<commit/>' <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?><response><lst name=&quot;responseHeader&quot;> <int name=&quot;status&quot;>0</int> <int name=&quot;QTime&quot;>20</int> </lst></response> </lst></response> </lst></response>
  • 37. Solr in 3 minutes! Download Solr from Apache Untar &quot;ant example&quot; Start the example app Load data into Solr Query
  • 38. Solr in Ten Minutes <Context docBase=&quot;/var/solr/apache-solr-1.2.0.war&quot; debug=&quot;0&quot; crossContext=&quot;true&quot; > <Environment name=&quot;solr/home&quot; type=&quot;java.lang.String&quot; value=&quot;/var/solr&quot; override=&quot;true&quot; /></Context> Copy Solr's example/solr dir to /var/solr Edit schema.xml and solrconfig.xml Load data into Solr In $CATALINA_HOME/conf/Catalina/localhost/foo.xml
  • 39. Directory Layout ${solr.home}/conf schema.xml solrconfig.xml ${solr.home}/data ${solr.home}/logs ${solr.home}/bin
  • 40. Java Solr Client Called SolrJ Not in Solr 1.2. I grabbed from the HEAD from svn Works with Solr 1.2 Add/Delete/Query/Commit/Optimize
  • 41. Adding Docs w/SolrJ Given Map<String, String> fields; CommonsHttpSolrServer server = new CommonsHttpSolrServer( url ); SolrInputDocument doc= new SolrInputDocument(); for (Map.Entry<String, String> e : fields.entrySet()){ doc.addField(e.getKey(), e.getValue()); } UpdateResponse res = server .add( doc);
  • 42. Deleting Docs w/SolrJ CommonsHttpSolrServer server = new CommonsHttpSolrServer( url ); UpdateResponse res; res = server .deleteById(&quot;100&quot;); res = server .deleteByQuery(&quot;city:paris&quot;);
  • 43. Simple Query CommonsHttpSolrServer server = new CommonsHttpSolrServer( url ); SolrQuery query = new SolrQuery(); query.setQuery(&quot;dance&quot;); QueryResponse rsp = server .query(query);
  • 44. More Interesting Query CommonsHttpSolrServer server = new CommonsHttpSolrServer( url ); SolrQuery query = new SolrQuery(); query.setQuery(&quot;dance&quot;); query.setFacet( true ); query.addFacetField(&quot;city&quot;); query.setFacetMinCount(1); query.addSortField( &quot;price&quot;, SolrQuery.ORDER.asc ); QueryResponse rsp = server .query(query);
  • 45. Query Responses QueryResponse qr = server .query(query); SolrDocumentList docs = qr.getResults(); List<FacetField> lf = qr.getFacetFields(); for (FacetField ff: lf) { String fieldName = ff.getName(); List<FacetField.Count> lc = ff.getValues(); for (FacetField.Count c: lc) { String countName = c.getName(); long count = c.getCount(); } }
  • 46. Other Commands Commit server.commit() Optimize server.optimize() Not too complicated!
  • 47. Request Handlers Request handler define how the query is processed. Two main types StandardRequestHandler DisMaxRequestHandler You can implement your own Changing in Solr 1.3
  • 48. &quot;Standard&quot; Request Handler Accepts Solr Query Syntax I tend to use it for my queries, not user queries.
  • 49. DisMaxRequestHandler Recommended for user queries Allows simple users keywords to be applied to multiple fields, with weighting. Boost Functions Boost Queries
  • 50. Boost Functions Allow you to influence scoring at run time Computationally Expensive! Really useful for tuning scoring linear(x,2,4) returns 2*x+4 x is a field
  • 51. The Solr Schema schema.xml Defines types used in this webapp Defines the fields and their types Defines &quot;copyFields&quot; READ THE EXAMPLE SCHEMA.XML
  • 52. Types Types define processing for a field How the words are split (Whitespace? Punctuation? CIA != C.I.A.) Stemming Case Folding, etc Predefined date, int, float, etc c
  • 53. Analysis: Index and Query Time Types have two modes Index Time Query Time
  • 54. Simple Text Field <fieldType name=&quot;text&quot; class=&quot;solr.TextField&quot; positionIncrementGap= &quot;100&quot;> <analyzer type=&quot;index&quot;> <tokenizer class=&quot;solr.WhitespaceTokenizerFactory&quot;/> <filter class=&quot;solr.StopFilterFactory&quot; ignoreCase=&quot;true&quot; words=&quot;stopwords.txt&quot;/></analyzer><analyzer type=&quot;query&quot;><tokenizer class=&quot;solr.WhitespaceTokenizerFactory&quot;/> <filter class=&quot;solr.SynonymFilterFactory&quot; synonyms=&quot;synonyms.txt&quot; ignoreCase=&quot;true&quot; expand=&quot;true&quot;/> <filter class=&quot;solr.StopFilterFactory&quot; ignoreCase=&quot;true&quot; words=&quot;stopwords.txt&quot;/></analyzer></fieldType> <filter class=&quot;solr.StopFilterFactory&quot; ignoreCase=&quot;true&quot; words=&quot;stopwords.txt&quot;/></analyzer></fieldType> <filter class=&quot;solr.StopFilterFactory&quot; ignoreCase=&quot;true&quot; words=&quot;stopwords.txt&quot;/></analyzer></fieldType>
  • 55. Analysis & Facets Make sure to use an untokenized field for faceting. &quot;San Jose&quot; != &quot;San&quot; &quot;Jose&quot;
  • 56. Fields Elements of a document Both predefined & dynamic Fields may occur multiple times Maybe indexed and/or stored
  • 57. Example Fields <field name=&quot;id&quot; type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;true&quot; required=&quot;true&quot; /><field name=&quot;name&quot; type=&quot;text&quot; indexed=&quot;true&quot; stored=&quot;true&quot;/><field name=&quot;alphaNameSort&quot; type=&quot;alphaOnlySort&quot; indexed=&quot;true&quot; stored=&quot;false&quot;/>
  • 58. Copy Fields Two main uses To analyze a field in two different ways To concatenate fields
  • 59. The Solr Config File solrconfig.xml Defines request handlers, defaults, caches, Read the example solrconfig.xml
  • 60. Configuring DisMax Parameter defaults set in solrconfig.xml Can be overridden in each request Except for params labeled invariant
  • 61. DisMax Config Example <requestHandler name=&quot;dismax&quot; class=&quot;solr.DisMaxRequestHandler&quot; > <lst name=&quot;defaults&quot;> <str name=&quot;echoParams&quot;>explicit</str> <float name=&quot;tie&quot;>0.01</float> <str name=&quot;qf&quot;> text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 </str> <str name=&quot;pf&quot;> text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9 </str> ... </requestHandler>
  • 62. DisMax Config Example <requestHandler name=&quot;dismax&quot; class=&quot;solr.DisMaxRequestHandler&quot; > ... <str name=&quot;bf&quot;> ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3 </str> <str name=&quot;fl&quot;> id,name,price,score </str>... </requestHandler>
  • 63. DisMax Config Example <requestHandler name=&quot;dismax&quot; class=&quot;solr.DisMaxRequestHandler&quot; > ... <str name=&quot;mm&quot;> 2&lt;-1 5&lt;-2 6&lt;90% </str> <int name=&quot;ps&quot;>100</int> <str name=&quot;q.alt&quot;>*:*</str> </lst> </requestHandler>
  • 65. Resources Solr http://lucene.apache.org/solr/ wiki, mailing list, jira (bugs/features) Lucene http://lucene.apache.org /
  • 67. Building Search Applications with Lucene, lingpipe and Gate Manu Konchady Manu Konchady Manu Konchady
  • 68. Other Presentations Yonik Seely's Solr & Lucene http://people.apache.org/~yonik/presentations/ Slideshare.net Search for solr, or search for lucene
  • 69. Thanks! Thanks for coming. Feel free to email me if you have questions about Solr Tom Hill [email_address]
  • 70. Extra Slides Things I didn't have time for in the presentation. Some of them unfinished.
  • 71. Search Engines are not the Same as Users Search engines have different usage patterns than users
  • 72. Response Writers http://localhost:8983/solr/select/?q=text_t%3Atiger&version=2.2&start=0&rows=10&indent=on& wt=ruby http://localhost:8983/solr/select/?q=text_t%3Atiger&version=2.2&start=0&rows=10&indent=on& wt=xml
  • 73. Explain Just why did the documents come up in that order?
  • 74. Data Matters Gigo The better the data is, the better the search will be.
  • 75. Watch Your Caches Just like any other app, check your statistics What's the hit rate for your caches?
  • 76. Setting Up Replication Run rsyncd on the master Run snapshot on the master at intervals Run snappuller on the slaves at (different) intervals. Scripts don't print errors! Check the logs Use bash -xv
  • 77. Autowarming Runs after an update to the index Updates flush caches Runs some queries to populate caches again Can be a problem, with frequent updates Don't autowarm master, if updating lots
  • 78. Tour Of Solr's Web UI
  • 80. Geographic Searching Local Lucene & Local Solr http://locallucene.wiki.sourceforge.net There's also geolucene, but it's not being actively developed, as far as I can tell. http://www.gossamer-threads.com/lists/l ucene/java-dev/53378