Apache Lucene adds the full-text search capability to our applications. Originally written in Java, it can be used with many other programming languages as well. Due to its powerful indexing and searching capabilities, Lucene is the backbone of many enterprise and open-source projects like Elasticsearch and Apache Solr, or crawling platforms such as Apache Nutch.
In this tutorial, learn the basic concepts of Apache Lucene and how to index and search documents in it.
1. Core Concepts and Terminology
Before diving into Lucene, it is essential to understand some core concepts and terminology:
1.1. Document and Field
A document is the central data unit in Lucene. It is similar to a data record. It consists of multiple fields whereas each field holds a specific piece of data. For example, a document, representing a book, can have fields for the title, author, and content.
In a document having multiple fields, each field can be stored, indexed, or both. Storing a field means its original value is retrievable in search results, while indexing means it can be searched.
1.2. Index
An index is a data structure that enables fast and efficient searching. Lucene indexes the documents by creating an inverted index i.e. maps the terms to the documents that contain them. This is similar to an index at the back of a book, which allows us to quickly find pages that mention a specific topic.
In Lucene, each indexed document is broken down into tokens (typically words) by an analyzer. Then each token is added to the inverted index. Note that the inverted index contains entries where each term/token points to a list of documents that contain that term.
1.3. Analyzer
An analyzer processes text data to transform it into a format that can be indexed by Lucene. It involves tokenization (breaking text into tokens or words), removing stop words, and other text processing.
Analyzer Class | Description |
---|---|
StandardAnalyzer | The default analyzer that works well for most languages. It includes tokenization, stop word removal, and lowercasing. |
KeywordAnalyzer | Treats the entire text as a single token, useful for keyword or exact match searches. |
WhitespaceAnalyzer | Splits text into tokens based on whitespace, without any further processing. |
SimpleAnalyzer | Splits text into tokens based on non-letter characters, converting them to lowercase. |
StopAnalyzer | Removes a predefined list of stop words from the text. |
EnglishAnalyzer | An analyzer optimized for the English language, including stemming and stop word removal. |
NGramAnalyzer | Breaks text into overlapping sequences of characters (n-grams). Useful for auto-completion and spelling correction. |
PatternAnalyzer | Uses a regular expression to split text into tokens. |
CustomAnalyzer | Allows the creation of custom analyzers by combining various tokenizers and filters. |
1.4. Query
A query is a request to search the index. Lucene supports the following commonly used query types. There are some more query types listed in the package org.apache.lucene.search:
Query Class | Description |
---|---|
TermQuery | Searches for documents containing a specific term. |
BooleanQuery | Combines multiple queries with boolean operators (AND , OR , NOT ). |
PhraseQuery | Searches for a sequence of terms (a phrase) in the documents. |
PrefixQuery | Searches for documents containing terms that start with a specified prefix. |
WildcardQuery | Searches for terms matching a pattern with wildcards (? for single character, * for multiple characters). |
FuzzyQuery | Searches for terms similar to a specified term, based on Levenshtein distance (edit distance). |
RangeQuery | Searches for documents containing terms within a specific range. |
MultiPhraseQuery | Searches for multiple phrases or terms within a specified position range. |
FunctionQuery | Uses a function to generate scores for documents based on field values. |
RegexpQuery | Searches for documents containing terms that match a regular expression. |
1.5. Directory
A directory is an abstraction for the storage used by the index. Lucene provides different directory implementations and each supports a unique feature for storage and retrieval. Let’s explore them:
Directory Class | Description |
---|---|
FSDirectory | Stores the index in the file system. |
MMapDirectory | Uses memory-mapped files for storage, providing fast read access to the index. |
NIOFSDirectory | Uses java.nio package for file access, suitable for systems with low concurrency. |
SimpleFSDirectory | Uses simple file I/O for file access, not using java.nio package. |
RAMDirectory | Stores the index entirely in memory using RAMDirectory . (Deprecated in favor of ByteBuffersDirectory .) |
ByteBuffersDirectory | Stores the index in memory using ByteBuffer , providing an efficient in-memory index. |
FileSwitchDirectory | Combines two directories, typically used to store small files in one directory and large files in another. |
TrackingDirectoryWrapper | A wrapper that tracks files that were written to, useful for replication or backup purposes. |
RateLimitedDirectoryWrapper | A wrapper that adds rate limiting to another Directory instance. |
WindowsDirectory | Optimized for use with Windows file systems, available in Lucene.Net. |
2. Maven
Start with adding these Lucene dependencies. We are using Lucene 9.10.0 and Java 21.
<properties>
<maven.compiler.source>21</maven.compiler.source>
<maven.compiler.target>21</maven.compiler.target>
<lucene.version>9.10.0</lucene.version>
</properties>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analysis-common</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>${lucene.version}</version>
</dependency>
3. Lucene Hello World Example with In-Memory Index Store
Let’s create a simple “Hello World” example using Lucene with an in-memory index store (ByteBuffersDirectory) and StandardAnalyzer.
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.ByteBuffersDirectory;
import java.io.IOException;
public class LuceneHelloWorld {
public static void main(String[] args) throws IOException, ParseException {
ByteBuffersDirectory directory = new ByteBuffersDirectory();
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);
// Create and add documents
Document doc1 = new Document();
doc1.add(new StringField("id", "1", Field.Store.YES));
doc1.add(new TextField("content", "Hello World from Apache Lucene", Field.Store.YES));
indexWriter.addDocument(doc1);
Document doc2 = new Document();
doc2.add(new StringField("id", "2", Field.Store.YES));
doc2.add(new TextField("content", "Lucene is a powerful search library", Field.Store.YES));
indexWriter.addDocument(doc2);
indexWriter.close();
String searchText= "Lucene";
// Create an IndexSearcher
DirectoryReader directoryReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(directoryReader);
// Parse a query
QueryParser queryParser = new QueryParser("content", analyzer);
Query query = queryParser.parse(searchText);
// Search the index
TopDocs topDocs = indexSearcher.search(query, 10);
// Display the results
System.out.println("Found " + topDocs.totalHits.value + " hits.");
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
int docId = topDocs.scoreDocs[i].doc;
Document doc = indexSearcher.doc(docId);
String id = doc.get("id");
String content = doc.get("content");
System.out.println("Doc ID: " + id);
System.out.println("Content: " + content);
}
directoryReader.close();
directory.close();
}
}
Let’s understand the above program:
- We use ByteBuffersDirectory to create an in-memory index and StandardAnalyzer to process the text data.
- The IndexWriter has been configured with the analyzer and the directory. We create documents with fields and add them to the index using IndexWriter.
- To search the documents, we use IndexSearcher. This class uses QueryParser and Query classes to parse a supplied query.
- The search results are received in the TopDocs result set. We iterate over the topDocs.scoreDocs to access the matching results and to which indexed document they belong.
When we run the program, the program output is:
Found 2 hits.
Doc ID: 1
Content: Hello World from Apache Lucene
Doc ID: 2
Content: Lucene is a powerful search library
4. Conclusion
In this Lucene tutorial, we learned the basic concepts and key terms used in the documentation. We learned about the main classes that are used for indexing and searching the document in stored indexes.
We also went through a simple hello world application to begin with. You are advised to play with other programs listed in Lucene examples to get a better understanding of the library.’
Happy Learning !!
Comments