SlideShare a Scribd company logo
PgConf EU 2014 presents 
Javier Ramirez 
* in * 
PostgreSQL 
Full-text search 
demystified 
@supercoco9 
https://teowaki.com
The problem
our architecture
Postgresql search demystified
One does not simply 
SELECT * from stuff where 
content ilike '%postgresql%'
Postgresql search demystified
Postgresql search demystified
Basic search features 
* stemmers (run, runner, running) 
* unaccented (josé, jose) 
* results highlighting 
* rank results by relevance
Nice to have features 
* partial searches 
* search operators (OR, AND...) 
* synonyms (postgres, postgresql, pgsql) 
* thesaurus (OS=Operating System) 
* fast, and space-efficient 
* debugging
Good News: 
PostgreSQL supports all 
the requested features
Bad News: 
unless you already know about search 
engines, the official docs are not obvious
How a search engine works 
* An indexing phase 
* A search phase
The indexing phase 
Convert the input text to tokens
The search phase 
Match the search terms to 
the indexed tokens
indexing in depth 
* choose an index format 
* tokenize the words 
* apply token analysis/filters 
* discard unwanted tokens
the index format 
* r-tree (GIST in PostgreSQL) 
* inverse indexes (GIN in PostgreSQL) 
* dynamic/distributed indexes
dynamic indexes: segmentation 
* sometimes the token index is 
segmented to allow faster updates 
* consolidate segments to speed-up 
search and account for deletions
tokenizing 
* parse/strip/convert format 
* normalize terms (unaccent, ascii, 
charsets, case folding, number precision..)
token analysis/filters 
* find synonyms 
* expand thesaurus 
* stem (maybe in different languages)
more token analysis/filters 
* eliminate stopwords 
* store word distance/frequency 
* store the full contents of some fields 
* store some fields as attributes/facets
“the index file” is really 
* a token file, probably segmented/distributed 
* some dictionary files: synonyms, thesaurus, 
stopwords, stems/lexems (in different languages) 
* word distance/frequency info 
* attributes/original field files 
* optional geospatial index 
* auxiliary files: word/sentence boundaries, meta-info, 
parser definitions, datasource definitions...
the hardest 
part is now 
over
searching in depth 
* tokenize/analyse 
* prepare operators 
* retrieve information 
* rank the results 
* highlight the matched parts
searching in depth: tokenize 
normalize, tokenize, and analyse 
the original search term 
the result would be a tokenized, stemmed, 
“synonymised” term, without stopwords
searching in depth: operators 
* partial search 
* logical/geospatial/range operators 
* in-sentence/in-paragraph/word distance 
* faceting/grouping
searching in depth: retrieval 
Go through the token index files, use the 
attributes and geospatial files if necessary 
for operators and/or grouping 
You might need to do this in a distributed way
searching in depth: ranking 
algorithm to sort the most relevant results: 
* field weights 
* word frequency/density 
* geospatial or timestamp ranking 
* ad-hoc ranking strategies
searching in depth: highlighting 
Mark the matching parts of the results 
It can be tricky/slow if you are not storing the full contents 
in your indexes
PostgreSQL as a 
full-text 
search engine
search features 
* index format configuration 
* partial search 
* word boundaries parser (not configurable) 
* stemmers/synonyms/thesaurus/stopwords 
* full-text logical operators 
* attributes/geo/timestamp/range (using SQL) 
* ranking strategies 
* highlighting 
* debugging/testing commands
indexing in postgresql 
you don't actually need an index to use full-text search in PostgreSQL 
but unless your db is very small, you want to have one 
Choose GIST or GIN (faster search, slower indexing, 
larger index size) 
CREATE INDEX pgweb_idx ON pgweb USING 
gin(to_tsvector(config_name, body));
Two new things 
CREATE INDEX ... USING gin(to_tsvector (config_name, body)); 
* to_tsvector: postgresql way of saying “tokenize” 
* config_name: tokenizing/analysis rule set
Configuration 
CREATE TEXT SEARCH CONFIGURATION 
public.teowaki ( COPY = pg_catalog.english );
Configuration 
CREATE TEXT SEARCH DICTIONARY english_ispell ( 
TEMPLATE = ispell, 
DictFile = en_us, 
AffFile = en_us, 
StopWords = spanglish 
); 
CREATE TEXT SEARCH DICTIONARY spanish_ispell ( 
TEMPLATE = ispell, 
DictFile = es_any, 
AffFile = es_any, 
StopWords = spanish 
);
Configuration 
CREATE TEXT SEARCH DICTIONARY english_stem ( 
TEMPLATE = snowball, 
Language = english, 
StopWords = english 
); 
CREATE TEXT SEARCH DICTIONARY spanish_stem ( 
TEMPLATE= snowball, 
Language = spanish, 
Stopwords = spanish 
);
Configuration 
Parser. 
Word boundaries
Configuration 
Assign dictionaries (in specific to generic order) 
ALTER TEXT SEARCH CONFIGURATION teowaki 
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, 
hword_part 
WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem; 
ALTER TEXT SEARCH CONFIGURATION teowaki 
DROP MAPPING FOR email, url, url_path, sfloat, float;
debugging 
select * from ts_debug('teowaki', 'I am searching unas 
b squedas ú con postgresql database'); 
also ts_lexize and ts_parser
tokenizing 
tokens + position (stopwords are removed, tokens are folded)
searching 
SELECT guid, description from wakis where 
to_tsvector('teowaki',description) 
@@ to_tsquery('teowaki','postgres');
searching 
SELECT guid, description from wakis where 
to_tsvector('teowaki',description) 
@@ to_tsquery('teowaki','postgres:*');
operators 
SELECT guid, description from wakis where 
to_tsvector('teowaki',description) 
@@ to_tsquery('teowaki','postgres | mysql');
ranking weights 
SELECT setweight(to_tsvector(coalesce(name,'')),'A') || 
setweight(to_tsvector(coalesce(description,'')),'B') 
from wakis limit 1;
search by weight
ranking 
SELECT name, ts_rank(to_tsvector(name), query) rank 
from wakis, to_tsquery('postgres | indexes') query 
where to_tsvector(name) @@ query order by rank DESC; 
also ts_rank_cd
highlighting 
SELECT ts_headline(name, query) from wakis, 
to_tsquery('teowaki', 'game|play') query 
where to_tsvector('teowaki', name) @@ query;
USE POSTGRESQL 
FOR EVERYTHING
When PostgreSQL is not good 
* You need to index files (PDF, Odx...) 
* Your index is very big (slow reindex) 
* You need a distributed index 
* You need complex tokenizers 
* You need advanced rankers
When PostgreSQL is not good 
* You want a REST API 
* You want sentence/ proximity/ range/ 
more complex operators 
* You want search auto completion 
* You want advanced features (alerts...)
But it has been 
perfect for us so far. 
Our users don't care 
which search engine 
we use, as long as 
it works.
PgConf EU 2014 presents 
Javier Ramirez 
* in * 
PostgreSQL 
Full-text search 
demystified 
@supercoco9 
https://teowaki.com

More Related Content

What's hot (20)

[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
NAVER D2
 
MongoDB-SESSION03
MongoDB-SESSION03
Jainul Musani
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화
Henry Jeong
 
Getting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NET
Tomas Jansson
 
Morphia: Simplifying Persistence for Java and MongoDB
Morphia: Simplifying Persistence for Java and MongoDB
Jeff Yemin
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Ontico
 
Fast querying indexing for performance (4)
Fast querying indexing for performance (4)
MongoDB
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)
MongoDB
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
Donghyeok Kang
 
How to Use JSON in MySQL Wrong
How to Use JSON in MySQL Wrong
Karwin Software Solutions LLC
 
Webinar: Index Tuning and Evaluation
Webinar: Index Tuning and Evaluation
MongoDB
 
Elastic search 검색
Elastic search 검색
HyeonSeok Choi
 
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Ontico
 
Indexing and Query Optimization
Indexing and Query Optimization
MongoDB
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
Holden Karau
 
Indexing & Query Optimization
Indexing & Query Optimization
MongoDB
 
Ts archiving
Ts archiving
Confiz
 
MongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() Output
MongoDB
 
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
NAVER D2
 
Mastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
NAVER D2
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화
Henry Jeong
 
Getting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NET
Tomas Jansson
 
Morphia: Simplifying Persistence for Java and MongoDB
Morphia: Simplifying Persistence for Java and MongoDB
Jeff Yemin
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Ontico
 
Fast querying indexing for performance (4)
Fast querying indexing for performance (4)
MongoDB
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)
MongoDB
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
Donghyeok Kang
 
Webinar: Index Tuning and Evaluation
Webinar: Index Tuning and Evaluation
MongoDB
 
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Ontico
 
Indexing and Query Optimization
Indexing and Query Optimization
MongoDB
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
Holden Karau
 
Indexing & Query Optimization
Indexing & Query Optimization
MongoDB
 
Ts archiving
Ts archiving
Confiz
 
MongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() Output
MongoDB
 
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
NAVER D2
 
Mastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 

Similar to Postgresql search demystified (20)

Introduction to Elasticsearch
Introduction to Elasticsearch
Sperasoft
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
searchbox-com
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 
Building node.js applications with Database Jones
Building node.js applications with Database Jones
John David Duncan
 
Get to know PostgreSQL!
Get to know PostgreSQL!
Oddbjørn Steffensen
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenches
Ismail Mayat
 
Lucene Introduction
Lucene Introduction
otisg
 
Compass Framework
Compass Framework
Lukas Vlcek
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
Stephan Schmidt
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
stubbles
 
Postgresql Database Administration Basic - Day2
Postgresql Database Administration Basic - Day2
PoguttuezhiniVP
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
Satoshi Nagayasu
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
Sameer Tiwari
 
Pxb For Yapc2008
Pxb For Yapc2008
maximgrp
 
Simplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with Morphia
MongoDB
 
Hands On Spring Data
Hands On Spring Data
Eric Bottard
 
ERRest and Dojo
ERRest and Dojo
WO Community
 
Introducing Struts 2
Introducing Struts 2
wiradikusuma
 
Softshake - Offline applications
Softshake - Offline applications
jeromevdl
 
Introduction to Elasticsearch
Introduction to Elasticsearch
Sperasoft
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
searchbox-com
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 
Building node.js applications with Database Jones
Building node.js applications with Database Jones
John David Duncan
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenches
Ismail Mayat
 
Lucene Introduction
Lucene Introduction
otisg
 
Compass Framework
Compass Framework
Lukas Vlcek
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
Stephan Schmidt
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
stubbles
 
Postgresql Database Administration Basic - Day2
Postgresql Database Administration Basic - Day2
PoguttuezhiniVP
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
Satoshi Nagayasu
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
Sameer Tiwari
 
Pxb For Yapc2008
Pxb For Yapc2008
maximgrp
 
Simplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with Morphia
MongoDB
 
Hands On Spring Data
Hands On Spring Data
Eric Bottard
 
Introducing Struts 2
Introducing Struts 2
wiradikusuma
 
Softshake - Offline applications
Softshake - Offline applications
jeromevdl
 
Ad

More from javier ramirez (20)

The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
javier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
javier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
javier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
javier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
javier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
javier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloud
javier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
Getting started with streaming analytics
Getting started with streaming analytics
javier ramirez
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
javier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
javier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
javier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
javier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
javier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
javier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloud
javier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
Getting started with streaming analytics
Getting started with streaming analytics
javier ramirez
 
Ad

Recently uploaded (20)

Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Intelli grow
 
UPDASP a project coordination unit ......
UPDASP a project coordination unit ......
withrj1
 
Code and No-Code Journeys: The Coverage Overlook
Code and No-Code Journeys: The Coverage Overlook
Applitools
 
Shell Skill Tree - LabEx Certification (LabEx)
Shell Skill Tree - LabEx Certification (LabEx)
VICTOR MAESTRE RAMIREZ
 
GDG Douglas - Google AI Agents: Your Next Intern?
GDG Douglas - Google AI Agents: Your Next Intern?
felipeceotto
 
dp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdf
pravkumarbiz
 
Automated Migration of ESRI Geodatabases Using XML Control Files and FME
Automated Migration of ESRI Geodatabases Using XML Control Files and FME
Safe Software
 
Advanced Token Development - Decentralized Innovation
Advanced Token Development - Decentralized Innovation
arohisinghas720
 
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Maxim Salnikov
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
Migrating to Azure Cosmos DB the Right Way
Migrating to Azure Cosmos DB the Right Way
Alexander (Alex) Komyagin
 
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
BradBedford3
 
Smadav Pro 2025 Rev 15.4 Crack Full Version With Registration Key
Smadav Pro 2025 Rev 15.4 Crack Full Version With Registration Key
joybepari360
 
How to Choose the Right Web Development Agency.pdf
How to Choose the Right Web Development Agency.pdf
Creative Fosters
 
How the US Navy Approaches DevSecOps with Raise 2.0
How the US Navy Approaches DevSecOps with Raise 2.0
Anchore
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
Step by step guide to install Flutter and Dart
Step by step guide to install Flutter and Dart
S Pranav (Deepu)
 
What is data visualization and how data visualization tool can help.pdf
What is data visualization and how data visualization tool can help.pdf
Varsha Nayak
 
AI and Deep Learning with NVIDIA Technologies
AI and Deep Learning with NVIDIA Technologies
SandeepKS52
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Intelli grow
 
UPDASP a project coordination unit ......
UPDASP a project coordination unit ......
withrj1
 
Code and No-Code Journeys: The Coverage Overlook
Code and No-Code Journeys: The Coverage Overlook
Applitools
 
Shell Skill Tree - LabEx Certification (LabEx)
Shell Skill Tree - LabEx Certification (LabEx)
VICTOR MAESTRE RAMIREZ
 
GDG Douglas - Google AI Agents: Your Next Intern?
GDG Douglas - Google AI Agents: Your Next Intern?
felipeceotto
 
dp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdf
pravkumarbiz
 
Automated Migration of ESRI Geodatabases Using XML Control Files and FME
Automated Migration of ESRI Geodatabases Using XML Control Files and FME
Safe Software
 
Advanced Token Development - Decentralized Innovation
Advanced Token Development - Decentralized Innovation
arohisinghas720
 
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Maxim Salnikov
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
BradBedford3
 
Smadav Pro 2025 Rev 15.4 Crack Full Version With Registration Key
Smadav Pro 2025 Rev 15.4 Crack Full Version With Registration Key
joybepari360
 
How to Choose the Right Web Development Agency.pdf
How to Choose the Right Web Development Agency.pdf
Creative Fosters
 
How the US Navy Approaches DevSecOps with Raise 2.0
How the US Navy Approaches DevSecOps with Raise 2.0
Anchore
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
Step by step guide to install Flutter and Dart
Step by step guide to install Flutter and Dart
S Pranav (Deepu)
 
What is data visualization and how data visualization tool can help.pdf
What is data visualization and how data visualization tool can help.pdf
Varsha Nayak
 
AI and Deep Learning with NVIDIA Technologies
AI and Deep Learning with NVIDIA Technologies
SandeepKS52
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 

Postgresql search demystified

  • 1. PgConf EU 2014 presents Javier Ramirez * in * PostgreSQL Full-text search demystified @supercoco9 https://teowaki.com
  • 5. One does not simply SELECT * from stuff where content ilike '%postgresql%'
  • 8. Basic search features * stemmers (run, runner, running) * unaccented (josé, jose) * results highlighting * rank results by relevance
  • 9. Nice to have features * partial searches * search operators (OR, AND...) * synonyms (postgres, postgresql, pgsql) * thesaurus (OS=Operating System) * fast, and space-efficient * debugging
  • 10. Good News: PostgreSQL supports all the requested features
  • 11. Bad News: unless you already know about search engines, the official docs are not obvious
  • 12. How a search engine works * An indexing phase * A search phase
  • 13. The indexing phase Convert the input text to tokens
  • 14. The search phase Match the search terms to the indexed tokens
  • 15. indexing in depth * choose an index format * tokenize the words * apply token analysis/filters * discard unwanted tokens
  • 16. the index format * r-tree (GIST in PostgreSQL) * inverse indexes (GIN in PostgreSQL) * dynamic/distributed indexes
  • 17. dynamic indexes: segmentation * sometimes the token index is segmented to allow faster updates * consolidate segments to speed-up search and account for deletions
  • 18. tokenizing * parse/strip/convert format * normalize terms (unaccent, ascii, charsets, case folding, number precision..)
  • 19. token analysis/filters * find synonyms * expand thesaurus * stem (maybe in different languages)
  • 20. more token analysis/filters * eliminate stopwords * store word distance/frequency * store the full contents of some fields * store some fields as attributes/facets
  • 21. “the index file” is really * a token file, probably segmented/distributed * some dictionary files: synonyms, thesaurus, stopwords, stems/lexems (in different languages) * word distance/frequency info * attributes/original field files * optional geospatial index * auxiliary files: word/sentence boundaries, meta-info, parser definitions, datasource definitions...
  • 22. the hardest part is now over
  • 23. searching in depth * tokenize/analyse * prepare operators * retrieve information * rank the results * highlight the matched parts
  • 24. searching in depth: tokenize normalize, tokenize, and analyse the original search term the result would be a tokenized, stemmed, “synonymised” term, without stopwords
  • 25. searching in depth: operators * partial search * logical/geospatial/range operators * in-sentence/in-paragraph/word distance * faceting/grouping
  • 26. searching in depth: retrieval Go through the token index files, use the attributes and geospatial files if necessary for operators and/or grouping You might need to do this in a distributed way
  • 27. searching in depth: ranking algorithm to sort the most relevant results: * field weights * word frequency/density * geospatial or timestamp ranking * ad-hoc ranking strategies
  • 28. searching in depth: highlighting Mark the matching parts of the results It can be tricky/slow if you are not storing the full contents in your indexes
  • 29. PostgreSQL as a full-text search engine
  • 30. search features * index format configuration * partial search * word boundaries parser (not configurable) * stemmers/synonyms/thesaurus/stopwords * full-text logical operators * attributes/geo/timestamp/range (using SQL) * ranking strategies * highlighting * debugging/testing commands
  • 31. indexing in postgresql you don't actually need an index to use full-text search in PostgreSQL but unless your db is very small, you want to have one Choose GIST or GIN (faster search, slower indexing, larger index size) CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
  • 32. Two new things CREATE INDEX ... USING gin(to_tsvector (config_name, body)); * to_tsvector: postgresql way of saying “tokenize” * config_name: tokenizing/analysis rule set
  • 33. Configuration CREATE TEXT SEARCH CONFIGURATION public.teowaki ( COPY = pg_catalog.english );
  • 34. Configuration CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, DictFile = en_us, AffFile = en_us, StopWords = spanglish ); CREATE TEXT SEARCH DICTIONARY spanish_ispell ( TEMPLATE = ispell, DictFile = es_any, AffFile = es_any, StopWords = spanish );
  • 35. Configuration CREATE TEXT SEARCH DICTIONARY english_stem ( TEMPLATE = snowball, Language = english, StopWords = english ); CREATE TEXT SEARCH DICTIONARY spanish_stem ( TEMPLATE= snowball, Language = spanish, Stopwords = spanish );
  • 37. Configuration Assign dictionaries (in specific to generic order) ALTER TEXT SEARCH CONFIGURATION teowaki ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem; ALTER TEXT SEARCH CONFIGURATION teowaki DROP MAPPING FOR email, url, url_path, sfloat, float;
  • 38. debugging select * from ts_debug('teowaki', 'I am searching unas b squedas ú con postgresql database'); also ts_lexize and ts_parser
  • 39. tokenizing tokens + position (stopwords are removed, tokens are folded)
  • 40. searching SELECT guid, description from wakis where to_tsvector('teowaki',description) @@ to_tsquery('teowaki','postgres');
  • 41. searching SELECT guid, description from wakis where to_tsvector('teowaki',description) @@ to_tsquery('teowaki','postgres:*');
  • 42. operators SELECT guid, description from wakis where to_tsvector('teowaki',description) @@ to_tsquery('teowaki','postgres | mysql');
  • 43. ranking weights SELECT setweight(to_tsvector(coalesce(name,'')),'A') || setweight(to_tsvector(coalesce(description,'')),'B') from wakis limit 1;
  • 45. ranking SELECT name, ts_rank(to_tsvector(name), query) rank from wakis, to_tsquery('postgres | indexes') query where to_tsvector(name) @@ query order by rank DESC; also ts_rank_cd
  • 46. highlighting SELECT ts_headline(name, query) from wakis, to_tsquery('teowaki', 'game|play') query where to_tsvector('teowaki', name) @@ query;
  • 47. USE POSTGRESQL FOR EVERYTHING
  • 48. When PostgreSQL is not good * You need to index files (PDF, Odx...) * Your index is very big (slow reindex) * You need a distributed index * You need complex tokenizers * You need advanced rankers
  • 49. When PostgreSQL is not good * You want a REST API * You want sentence/ proximity/ range/ more complex operators * You want search auto completion * You want advanced features (alerts...)
  • 50. But it has been perfect for us so far. Our users don't care which search engine we use, as long as it works.
  • 51. PgConf EU 2014 presents Javier Ramirez * in * PostgreSQL Full-text search demystified @supercoco9 https://teowaki.com