SlideShare a Scribd company logo
Search @twitter 
Michael Busch 
@michibusch 
michael@twitter.com 
buschmi@apache.org
Search @twitter 
Agenda 
‣ Introduction 
- Search Architecture 
- Lucene Extensions 
- Outlook
Search at Twitter: Presented by Michael Busch, Twitter
Introduction
Introduction 
Twitter has more than 284 million 
monthly active users.
Introduction 
500 million tweets are sent per day.
Introduction 
More than 300 billion tweets have been 
sent since company founding in 2006.
Introduction 
Tweets-per-second record: 
one-second peak of 143,199 TPS.
Introduction 
More than 2 billion search queries per 
day.
Search @twitter 
Agenda 
- Introduction 
‣ Search Architecture 
- Lucene Extensions 
- Outlook
Search at Twitter: Presented by Michael Busch, Twitter
Search Architecture
RT index 
Search Architecture 
RT stream 
Analyzer/ 
Partitioner 
RT index 
(Earlybird) 
Blender 
Archive 
index 
RT index 
Mapreduce 
Analyzer 
raw 
tweets 
Tweet archive 
HDFS 
Search 
requests 
writes 
searches 
analyzed 
tweets 
analyzed 
tweets 
raw 
tweets
RT index 
Search Architecture 
Tweets 
Analyzer/ 
Partitioner 
RT index 
(Earlybird) 
Blender 
Archive 
index 
RT index 
queue 
HDFS 
Search 
requests 
Updates Deletes/ 
Engagement (e.g. retweets/favs) 
writes 
searches 
Mapreduce 
Analyzer
RT index 
Search Architecture 
RT index 
(Earlybird) 
Social 
graph Social 
Blender 
Archive 
index 
RT index 
User 
search 
Search 
requests 
writes 
searches 
• Blender is our Thrift 
service aggregator 
• Queries multiple 
Earlybirds, merges results 
Social 
graph 
graph
Search Architecture 
RT index 
(Earlybird) 
Archive 
index 
User 
search
Search Architecture 
RT index 
(Earlybird) 
Archive 
index 
• For historic reasons, these used 
to be entirely different codebases, 
but had similar features/ 
technologies 
• Over time cross-dependencies 
were introduced to share code 
User 
search 
Lucene
Search Architecture 
RT index 
(Earlybird) 
Archive 
index 
User 
search 
Lucene 
Extensions 
Lucene 
• New Lucene extension package 
• This package is truly generic and 
has no dependency on an actual 
product/index 
• It contains Twitter’s extensions for 
real-time search, a thin segment 
management layer and other 
features
Search @twitter 
Agenda 
- Introduction 
- Search Architecture 
‣ Lucene Extensions 
- Outlook
Search at Twitter: Presented by Michael Busch, Twitter
Lucene Extensions
Lucene Extension Library 
• Abstraction layer for Lucene index segments 
• Real-time writer for in-memory index segments 
• Schema-based Lucene document factory 
• Real-time faceting
Lucene Extension Library 
• API layer for Lucene segments 
• *IndexSegmentWriter 
• *IndexSegmentAtomicReader 
• Two implementations 
• In-memory: RealtimeIndexSegmentWriter (and reader) 
• On-disk: LuceneIndexSegmentWriter (and reader)
Lucene Extension Library 
• IndexSegments can be built ... 
• in realtime 
• on Mesos or Hadoop (Mapreduce) 
• locally on serving machines 
• Cluster-management code that deals with IndexSegments 
• Share segments across serving machines using HDFS 
• Can rebuild segments (e.g. to upgrade Lucene version, change data 
schema, etc.)
Lucene Extension Library 
HDFS EEEaararlyrlylbybbirirdirdd 
Mesos 
Hadoop (MR) 
RT pipeline
RealtimeIndexSegmentWriter 
• Modified Lucene index implementation optimized for realtime search 
• IndexWriter buffer is searchable (no need to flush to allow searching) 
• In-memory 
• Lock-free concurrency model for best performance
Concurrency - Definitions 
• Pessimistic locking 
• A thread holds an exclusive lock on a resource, while an action is 
performed [mutual exclusion] 
• Usually used when conflicts are expected to be likely 
• Optimistic locking 
• Operations are tried to be performed atomically without holding a lock; 
conflicts can be detected; retry logic is often used in case of conflicts 
• Usually used when conflicts are expected to be the exception
Concurrency - Definitions 
• Non-blocking algorithm 
Ensures, that threads competing for shared resources do not have their 
execution indefinitely postponed by mutual exclusion. 
• Lock-free algorithm 
A non-blocking algorithm is lock-free if there is guaranteed system-wide 
progress. 
• Wait-free algorithm 
A non-blocking algorithm is wait-free, if there is guaranteed per-thread 
progress. 
* Source: Wikipedia
Concurrency 
• Having a single writer thread simplifies our problem: no locks have to be used 
to protect data structures from corruption (only one thread modifies data) 
• But: we have to make sure that all readers always see a consistent state of 
all data structures -> this is much harder than it sounds! 
• In Java, it is not guaranteed that one thread will see changes that another 
thread makes in program execution order, unless the same memory barrier is 
crossed by both threads -> safe publication 
• Safe publication can be achieved in different, subtle ways. Read the great 
book “Java concurrency in practice” by Brian Goetz for more information!
Java Memory Model 
• Program order rule 
Each action in a thread happens-before every action in that thread that comes 
later in the program order. 
• Volatile variable rule 
A write to a volatile field happens-before every subsequent read of that same 
field. 
• Transitivity 
If A happens-before B, and B happens-before C, then A happens-before C. 
* Source: Brian Goetz: Java Concurrency in Practice
Concurrency 
RAM 0 
int x; 
Cache 
Thread 1 Thread 2 
time
Concurrency 
Cache 5 
RAM 0 
int x; 
Thread 1 Thread 2 
x = 5; 
Thread A writes x=5 to cache 
time
Concurrency 
Cache 5 
RAM 0 
int x; 
Thread 1 Thread 2 
x = 5; 
time while(x != 5); 
This condition will likely 
never become false!
Concurrency 
RAM 0 
int x; 
Cache 
Thread 1 Thread 2 
time
Concurrency 
RAM 0 
int x; 
Thread A writes b=1 to RAM, 
because b is volatile 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1;
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
Read volatile b 
int dummy = b; 
while(x != 5);
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
happens-before 
• Program order rule: Each action in a thread happens-before every action in 
that thread that comes later in the program order.
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
happens-before 
• Volatile variable rule: A write to a volatile field happens-before every 
subsequent read of that same field.
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
happens-before 
• Transitivity: If A happens-before B, and B happens-before C, then A 
happens-before C.
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
This condition will be 
false, i.e. x==5 
• Note: x itself doesn’t have to be volatile. There can be many variables like x, 
but we need only a single volatile field.
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
Memory barrier 
• Note: x itself doesn’t have to be volatile. There can be many variables like x, 
but we need only a single volatile field.
Search at Twitter: Presented by Michael Busch, Twitter
Demo
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
Memory barrier 
• Note: x itself doesn’t have to be volatile. There can be many variables like x, 
but we need only a single volatile field.
Concurrency 
IndexWriter IndexReader 
time 
write 100 docs 
maxDoc = 100 
in IR.open(): read maxDoc 
search upto maxDoc 
write more docs 
maxDoc is volatile
Concurrency 
IndexWriter IndexReader 
time 
write 100 docs 
maxDoc = 100 
in IR.open(): read maxDoc 
search upto maxDoc 
write more docs 
maxDoc is volatile 
happens-before 
• Only maxDoc is volatile. All other fields that IW writes to and IR reads from 
don’t need to be!
Wait-free 
• Not a single exclusive lock 
• Writer thread can always make progress 
• Optimistic locking (retry-logic) in a few places for searcher thread 
• Retry logic very simple and guaranteed to always make progress
In-memory Real-time Index 
• Highly optimized for GC - all data is stored in blocked native arrays 
• v1: Optimized for tweets with a term position limit of 255 
• v2: Support for 32 bit positions without performance degradation 
• v2: Basic support for out-of-order posting list inserts
In-memory Real-time Index 
• Highly optimized for GC - all data is stored in blocked native arrays 
• v1: Optimized for tweets with a term position limit of 255 
• v2: Support for 32 bit positions without performance degradation 
• v2: Basic support for out-of-order posting list inserts
In-memory Real-time Index 
• RT term dictionary 
• Term lookups using a lock-free hashtable in O(1) 
• v2: Additional probabilistic, lock-free skip list maintains ordering on terms 
• Perfect skip list not an option: out-of-order inserts would require 
rebalancing, which is impractical with our lock-free index 
• In a probabilistic skip list the tower height of a new (out-of-order) item can 
be determined without knowing its insert position by simply rolling a dice
In-memory Real-time Index 
• Perfect skip list
In-memory Real-time Index 
• Perfect skip list 
Inserting a new element in the middle of this 
skip list requires re-balancing the towers.
In-memory Real-time Index 
• Probabilistic skip list
In-memory Real-time Index 
• Probabilistic skip list Tower height determined by rolling a dice 
BEFORE knowing the insert location; tower height 
never has to change for an element, simplifying 
memory allocation and concurrency.
Schema-based Document factory 
• Apps provide one ThriftSchema per index and create a ThriftDocument for 
each document 
• SchemaDocumentFactory translates ThriftDocument -> Lucene Document 
using the Schema 
• Default field values 
• Extended field settings 
• Type-system on top of DocValues 
• Validation
Schema-based Document factory 
Schema 
Lucene 
Document 
SchemaDocument 
Factory 
Thrift 
Document 
• Validation 
• Fill in default values 
• Apply correct Lucene 
field settings
Schema-based Document factory 
Schema 
Lucene 
Document 
SchemaDocument 
Factory 
Thrift 
Document 
• Validation 
• Fill in default values 
• Apply correct Lucene 
field settings 
Decouples core package from 
specific product/index. Similar 
to Solr/ElasticSearch.
Search @twitter 
Agenda 
- Introduction 
- Search Architecture 
- Lucene Extensions 
‣ Outlook
Search at Twitter: Presented by Michael Busch, Twitter
Outlook
Outlook 
• Support for parallel (sliced) segments to support partial segment rebuilds 
and other cool posting list update patterns 
• Add remaining missing Lucene features to RT index 
• Index term statistics for ranking 
• Term vectors 
• Stored fields
Questions? 
Michael Busch 
@michibusch 
michael@twitter.com 
buschmi@apache.org
Search at Twitter: Presented by Michael Busch, Twitter
Backup Slides
Searching for top entities within Tweets 
• Task: Find the best photos in a subset of tweets 
• We could use a Lucene index, where each photo is a document 
• Problem: How to update existing documents when the same photos are 
tweeted again? 
• In-place posting list updates are hard 
• Lucene’s updateDocument() is a delete/add operation - expensive and not 
order-preserving
Searching for top entities within Tweets 
• Task: Find the best photos in a subset of tweets 
• Could we use our existing time-ordered tweet index? 
• Facets!
Searching for top entities within Tweets 
Query Doc ids 
Inverted 
index 
Term id Term label 
Forward 
Doc id index Document 
Metadata 
Facet 
index 
Doc id Term ids
Storing tweet metadata 
Facet 
Doc id index Term ids
5 15 9000 9002 100000 100090 
Matching 
doc id 
Facet 
index 
Term ids 
Top-k heap 
Id Count 
48239 8 
31241 2 
Query 
Searching for top entities within Tweets
5 15 9000 9002 100000 100090 
Matching 
doc id 
Facet 
index 
Term ids 
Top-k heap 
Id Count 
48239 15 
31241 12 
85932 8 
6748 3 
Query 
Searching for top entities within Tweets
Searching for top entities within Tweets 
5 15 9000 9002 100000 100090 
Matching 
doc id 
Facet 
index 
Term ids 
Top-k heap 
Id Count 
48239 15 
31241 12 
85932 8 
6748 3 
Query 
Weighted counts (from 
engagement features) used 
for relevance scoring
Searching for top entities within Tweets 
5 15 9000 9002 100000 100090 
Matching 
doc id 
Facet 
index 
Term ids 
Top-k heap 
Id Count 
48239 15 
31241 12 
85932 8 
6748 3 
Query 
All query operators can be 
used. E.g. find best photos in 
San Francisco tweeted by 
people I follow
Searching for top entities within Tweets 
Inverted 
Term id index Term label
Searching for top entities within Tweets 
Id Count Label Count 
pic.twitter.com/jknui4w 45 
pic.twitter.com/dslkfj83 23 
pic.twitter.com/acm3ps 15 
pic.twitter.com/948jdsd 11 
pic.twitter.com/dsjkf15h 8 
pic.twitter.com/irnsoa32 5 
48239 45 
31241 23 
85932 15 
6748 11 
74294 8 
3728 5 
Inverted 
index
Summary 
• Indexing tweet entities (e.g. photos) as facets allows to search and rank top-entities 
using a tweets index 
• All query operators supported 
• Documents don’t need to be reindexed 
• Approach reusable for different use cases, e.g.: best vines, hashtags, 
@mentions, etc.

More Related Content

PPTX
ゲーム組み込み向け独自スクリプト言語の設計で気を付けている事
PDF
The Real Cost of Slow Time vs Downtime
PPTX
Spark introduction and architecture
PPTX
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
PDF
Spark Autotuning Talk - Strata New York
PDF
Stateful, Stateless and Serverless - Running Apache Kafka® on Kubernetes
PDF
Tesla Hacking to FreedomEV
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
ゲーム組み込み向け独自スクリプト言語の設計で気を付けている事
The Real Cost of Slow Time vs Downtime
Spark introduction and architecture
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Spark Autotuning Talk - Strata New York
Stateful, Stateless and Serverless - Running Apache Kafka® on Kubernetes
Tesla Hacking to FreedomEV
Apache Iceberg - A Table Format for Hige Analytic Datasets

What's hot (20)

PDF
OrientDB document or graph? Select the right model (old presentation)
PDF
Introduction to redis - version 2
PDF
Stream Processing – Concepts and Frameworks
PPTX
Flink vs. Spark
PDF
Kafka as an Event Store - is it Good Enough?
PDF
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
PDF
Bezzo - energia biomassa
PPT
Switching from relational to the graph model
PDF
Lock free queue
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
CDI Best Practices with Real-Life Examples - TUT3287
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
ODP
Elastic Stack ELK, Beats, and Cloud
PDF
技術紹介: S2E: Selective Symbolic Execution Engine
PDF
Parallelizing with Apache Spark in Unexpected Ways
PDF
게임 서버 성능 분석하기
PDF
The State of Spark in the Cloud with Nicolas Poggi
PPTX
Руководство по пищевой безопасности — Анна Кириленко
PPTX
Data Federation with Apache Spark
PDF
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
OrientDB document or graph? Select the right model (old presentation)
Introduction to redis - version 2
Stream Processing – Concepts and Frameworks
Flink vs. Spark
Kafka as an Event Store - is it Good Enough?
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
Bezzo - energia biomassa
Switching from relational to the graph model
Lock free queue
Presto Summit 2018 - 09 - Netflix Iceberg
CDI Best Practices with Real-Life Examples - TUT3287
Introduction to Apache NiFi dws19 DWS - DC 2019
Elastic Stack ELK, Beats, and Cloud
技術紹介: S2E: Selective Symbolic Execution Engine
Parallelizing with Apache Spark in Unexpected Ways
게임 서버 성능 분석하기
The State of Spark in the Cloud with Nicolas Poggi
Руководство по пищевой безопасности — Анна Кириленко
Data Federation with Apache Spark
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
Ad

Viewers also liked (20)

PPTX
Introduction to Lucene & Solr and Usecases
PDF
Realtime Search at Twitter - Michael Busch
PDF
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
PDF
What's New in Solr 3.x / 4.0
PPT
Type-Safe MongoDB query (Lift Rogue query)
PDF
11 lucene
PPTX
Faceting with Lucene Block Join Query - Lucene/Solr Revolution 2014
PDF
This Ain't Your Parent's Search Engine: Presented by Grant Ingersoll, Lucidworks
PDF
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
PDF
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
PDF
Lucene/Solr Spatial in 2015: Presented by David Smiley
PDF
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
PDF
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
PDF
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
PDF
A Survey of Elasticsearch Usage
PDF
Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
PDF
MongoDB: Queries and Aggregation Framework with NBA Game Data
PDF
The Many Facets of Apache Solr - Yonik Seeley
PDF
Webinar: Ecommerce, Rules, and Relevance
PDF
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Introduction to Lucene & Solr and Usecases
Realtime Search at Twitter - Michael Busch
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
What's New in Solr 3.x / 4.0
Type-Safe MongoDB query (Lift Rogue query)
11 lucene
Faceting with Lucene Block Join Query - Lucene/Solr Revolution 2014
This Ain't Your Parent's Search Engine: Presented by Grant Ingersoll, Lucidworks
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucene/Solr Spatial in 2015: Presented by David Smiley
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
A Survey of Elasticsearch Usage
Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
MongoDB: Queries and Aggregation Framework with NBA Game Data
The Many Facets of Apache Solr - Yonik Seeley
Webinar: Ecommerce, Rules, and Relevance
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Ad

Similar to Search at Twitter: Presented by Michael Busch, Twitter (20)

PDF
Nov 2011 HUG: Blur - Lucene on Hadoop
PDF
Voldemort Nosql
PDF
Java Concurrency, A(nother) Peek Under the Hood [Code One 2019]
PDF
The Need for Async @ ScalaWorld
PDF
Challenges in Maintaining a High Performance Search Engine Written in Java
PPT
Realtime search at Yammer
PPT
Real Time Search at Yammer
PPT
Real-time Search at Yammer - By Aleksandrovsky Boris
KEY
Modern Java Concurrency
PDF
Simon Peyton Jones: Managing parallelism
PDF
Peyton jones-2011-parallel haskell-the_future
PDF
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
PPTX
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
PPTX
Memory model
PPTX
Concurrency Constructs Overview
PDF
无锁编程
PDF
Concurrency
PDF
Need for Async: Hot pursuit for scalable applications
PDF
Java Concurrency Quick Guide
PDF
Java Memory Model
Nov 2011 HUG: Blur - Lucene on Hadoop
Voldemort Nosql
Java Concurrency, A(nother) Peek Under the Hood [Code One 2019]
The Need for Async @ ScalaWorld
Challenges in Maintaining a High Performance Search Engine Written in Java
Realtime search at Yammer
Real Time Search at Yammer
Real-time Search at Yammer - By Aleksandrovsky Boris
Modern Java Concurrency
Simon Peyton Jones: Managing parallelism
Peyton jones-2011-parallel haskell-the_future
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Memory model
Concurrency Constructs Overview
无锁编程
Concurrency
Need for Async: Hot pursuit for scalable applications
Java Concurrency Quick Guide
Java Memory Model

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Mushroom cultivation and it's methods.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
project resource management chapter-09.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
Hindi spoken digit analysis for native and non-native speakers
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
cloud_computing_Infrastucture_as_cloud_p
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Mushroom cultivation and it's methods.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
TLE Review Electricity (Electricity).pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
1 - Historical Antecedents, Social Consideration.pdf
Enhancing emotion recognition model for a student engagement use case through...
WOOl fibre morphology and structure.pdf for textiles
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
project resource management chapter-09.pdf
MIND Revenue Release Quarter 2 2025 Press Release
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Heart disease approach using modified random forest and particle swarm optimi...
Building Integrated photovoltaic BIPV_UPV.pdf
Assigned Numbers - 2025 - Bluetooth® Document

Search at Twitter: Presented by Michael Busch, Twitter

  • 2. Search @twitter Agenda ‣ Introduction - Search Architecture - Lucene Extensions - Outlook
  • 5. Introduction Twitter has more than 284 million monthly active users.
  • 6. Introduction 500 million tweets are sent per day.
  • 7. Introduction More than 300 billion tweets have been sent since company founding in 2006.
  • 8. Introduction Tweets-per-second record: one-second peak of 143,199 TPS.
  • 9. Introduction More than 2 billion search queries per day.
  • 10. Search @twitter Agenda - Introduction ‣ Search Architecture - Lucene Extensions - Outlook
  • 13. RT index Search Architecture RT stream Analyzer/ Partitioner RT index (Earlybird) Blender Archive index RT index Mapreduce Analyzer raw tweets Tweet archive HDFS Search requests writes searches analyzed tweets analyzed tweets raw tweets
  • 14. RT index Search Architecture Tweets Analyzer/ Partitioner RT index (Earlybird) Blender Archive index RT index queue HDFS Search requests Updates Deletes/ Engagement (e.g. retweets/favs) writes searches Mapreduce Analyzer
  • 15. RT index Search Architecture RT index (Earlybird) Social graph Social Blender Archive index RT index User search Search requests writes searches • Blender is our Thrift service aggregator • Queries multiple Earlybirds, merges results Social graph graph
  • 16. Search Architecture RT index (Earlybird) Archive index User search
  • 17. Search Architecture RT index (Earlybird) Archive index • For historic reasons, these used to be entirely different codebases, but had similar features/ technologies • Over time cross-dependencies were introduced to share code User search Lucene
  • 18. Search Architecture RT index (Earlybird) Archive index User search Lucene Extensions Lucene • New Lucene extension package • This package is truly generic and has no dependency on an actual product/index • It contains Twitter’s extensions for real-time search, a thin segment management layer and other features
  • 19. Search @twitter Agenda - Introduction - Search Architecture ‣ Lucene Extensions - Outlook
  • 22. Lucene Extension Library • Abstraction layer for Lucene index segments • Real-time writer for in-memory index segments • Schema-based Lucene document factory • Real-time faceting
  • 23. Lucene Extension Library • API layer for Lucene segments • *IndexSegmentWriter • *IndexSegmentAtomicReader • Two implementations • In-memory: RealtimeIndexSegmentWriter (and reader) • On-disk: LuceneIndexSegmentWriter (and reader)
  • 24. Lucene Extension Library • IndexSegments can be built ... • in realtime • on Mesos or Hadoop (Mapreduce) • locally on serving machines • Cluster-management code that deals with IndexSegments • Share segments across serving machines using HDFS • Can rebuild segments (e.g. to upgrade Lucene version, change data schema, etc.)
  • 25. Lucene Extension Library HDFS EEEaararlyrlylbybbirirdirdd Mesos Hadoop (MR) RT pipeline
  • 26. RealtimeIndexSegmentWriter • Modified Lucene index implementation optimized for realtime search • IndexWriter buffer is searchable (no need to flush to allow searching) • In-memory • Lock-free concurrency model for best performance
  • 27. Concurrency - Definitions • Pessimistic locking • A thread holds an exclusive lock on a resource, while an action is performed [mutual exclusion] • Usually used when conflicts are expected to be likely • Optimistic locking • Operations are tried to be performed atomically without holding a lock; conflicts can be detected; retry logic is often used in case of conflicts • Usually used when conflicts are expected to be the exception
  • 28. Concurrency - Definitions • Non-blocking algorithm Ensures, that threads competing for shared resources do not have their execution indefinitely postponed by mutual exclusion. • Lock-free algorithm A non-blocking algorithm is lock-free if there is guaranteed system-wide progress. • Wait-free algorithm A non-blocking algorithm is wait-free, if there is guaranteed per-thread progress. * Source: Wikipedia
  • 29. Concurrency • Having a single writer thread simplifies our problem: no locks have to be used to protect data structures from corruption (only one thread modifies data) • But: we have to make sure that all readers always see a consistent state of all data structures -> this is much harder than it sounds! • In Java, it is not guaranteed that one thread will see changes that another thread makes in program execution order, unless the same memory barrier is crossed by both threads -> safe publication • Safe publication can be achieved in different, subtle ways. Read the great book “Java concurrency in practice” by Brian Goetz for more information!
  • 30. Java Memory Model • Program order rule Each action in a thread happens-before every action in that thread that comes later in the program order. • Volatile variable rule A write to a volatile field happens-before every subsequent read of that same field. • Transitivity If A happens-before B, and B happens-before C, then A happens-before C. * Source: Brian Goetz: Java Concurrency in Practice
  • 31. Concurrency RAM 0 int x; Cache Thread 1 Thread 2 time
  • 32. Concurrency Cache 5 RAM 0 int x; Thread 1 Thread 2 x = 5; Thread A writes x=5 to cache time
  • 33. Concurrency Cache 5 RAM 0 int x; Thread 1 Thread 2 x = 5; time while(x != 5); This condition will likely never become false!
  • 34. Concurrency RAM 0 int x; Cache Thread 1 Thread 2 time
  • 35. Concurrency RAM 0 int x; Thread A writes b=1 to RAM, because b is volatile 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1;
  • 36. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; Read volatile b int dummy = b; while(x != 5);
  • 37. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); happens-before • Program order rule: Each action in a thread happens-before every action in that thread that comes later in the program order.
  • 38. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); happens-before • Volatile variable rule: A write to a volatile field happens-before every subsequent read of that same field.
  • 39. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); happens-before • Transitivity: If A happens-before B, and B happens-before C, then A happens-before C.
  • 40. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); This condition will be false, i.e. x==5 • Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.
  • 41. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); Memory barrier • Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.
  • 43. Demo
  • 44. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); Memory barrier • Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.
  • 45. Concurrency IndexWriter IndexReader time write 100 docs maxDoc = 100 in IR.open(): read maxDoc search upto maxDoc write more docs maxDoc is volatile
  • 46. Concurrency IndexWriter IndexReader time write 100 docs maxDoc = 100 in IR.open(): read maxDoc search upto maxDoc write more docs maxDoc is volatile happens-before • Only maxDoc is volatile. All other fields that IW writes to and IR reads from don’t need to be!
  • 47. Wait-free • Not a single exclusive lock • Writer thread can always make progress • Optimistic locking (retry-logic) in a few places for searcher thread • Retry logic very simple and guaranteed to always make progress
  • 48. In-memory Real-time Index • Highly optimized for GC - all data is stored in blocked native arrays • v1: Optimized for tweets with a term position limit of 255 • v2: Support for 32 bit positions without performance degradation • v2: Basic support for out-of-order posting list inserts
  • 49. In-memory Real-time Index • Highly optimized for GC - all data is stored in blocked native arrays • v1: Optimized for tweets with a term position limit of 255 • v2: Support for 32 bit positions without performance degradation • v2: Basic support for out-of-order posting list inserts
  • 50. In-memory Real-time Index • RT term dictionary • Term lookups using a lock-free hashtable in O(1) • v2: Additional probabilistic, lock-free skip list maintains ordering on terms • Perfect skip list not an option: out-of-order inserts would require rebalancing, which is impractical with our lock-free index • In a probabilistic skip list the tower height of a new (out-of-order) item can be determined without knowing its insert position by simply rolling a dice
  • 51. In-memory Real-time Index • Perfect skip list
  • 52. In-memory Real-time Index • Perfect skip list Inserting a new element in the middle of this skip list requires re-balancing the towers.
  • 53. In-memory Real-time Index • Probabilistic skip list
  • 54. In-memory Real-time Index • Probabilistic skip list Tower height determined by rolling a dice BEFORE knowing the insert location; tower height never has to change for an element, simplifying memory allocation and concurrency.
  • 55. Schema-based Document factory • Apps provide one ThriftSchema per index and create a ThriftDocument for each document • SchemaDocumentFactory translates ThriftDocument -> Lucene Document using the Schema • Default field values • Extended field settings • Type-system on top of DocValues • Validation
  • 56. Schema-based Document factory Schema Lucene Document SchemaDocument Factory Thrift Document • Validation • Fill in default values • Apply correct Lucene field settings
  • 57. Schema-based Document factory Schema Lucene Document SchemaDocument Factory Thrift Document • Validation • Fill in default values • Apply correct Lucene field settings Decouples core package from specific product/index. Similar to Solr/ElasticSearch.
  • 58. Search @twitter Agenda - Introduction - Search Architecture - Lucene Extensions ‣ Outlook
  • 61. Outlook • Support for parallel (sliced) segments to support partial segment rebuilds and other cool posting list update patterns • Add remaining missing Lucene features to RT index • Index term statistics for ranking • Term vectors • Stored fields
  • 65. Searching for top entities within Tweets • Task: Find the best photos in a subset of tweets • We could use a Lucene index, where each photo is a document • Problem: How to update existing documents when the same photos are tweeted again? • In-place posting list updates are hard • Lucene’s updateDocument() is a delete/add operation - expensive and not order-preserving
  • 66. Searching for top entities within Tweets • Task: Find the best photos in a subset of tweets • Could we use our existing time-ordered tweet index? • Facets!
  • 67. Searching for top entities within Tweets Query Doc ids Inverted index Term id Term label Forward Doc id index Document Metadata Facet index Doc id Term ids
  • 68. Storing tweet metadata Facet Doc id index Term ids
  • 69. 5 15 9000 9002 100000 100090 Matching doc id Facet index Term ids Top-k heap Id Count 48239 8 31241 2 Query Searching for top entities within Tweets
  • 70. 5 15 9000 9002 100000 100090 Matching doc id Facet index Term ids Top-k heap Id Count 48239 15 31241 12 85932 8 6748 3 Query Searching for top entities within Tweets
  • 71. Searching for top entities within Tweets 5 15 9000 9002 100000 100090 Matching doc id Facet index Term ids Top-k heap Id Count 48239 15 31241 12 85932 8 6748 3 Query Weighted counts (from engagement features) used for relevance scoring
  • 72. Searching for top entities within Tweets 5 15 9000 9002 100000 100090 Matching doc id Facet index Term ids Top-k heap Id Count 48239 15 31241 12 85932 8 6748 3 Query All query operators can be used. E.g. find best photos in San Francisco tweeted by people I follow
  • 73. Searching for top entities within Tweets Inverted Term id index Term label
  • 74. Searching for top entities within Tweets Id Count Label Count pic.twitter.com/jknui4w 45 pic.twitter.com/dslkfj83 23 pic.twitter.com/acm3ps 15 pic.twitter.com/948jdsd 11 pic.twitter.com/dsjkf15h 8 pic.twitter.com/irnsoa32 5 48239 45 31241 23 85932 15 6748 11 74294 8 3728 5 Inverted index
  • 75. Summary • Indexing tweet entities (e.g. photos) as facets allows to search and rank top-entities using a tweets index • All query operators supported • Documents don’t need to be reindexed • Approach reusable for different use cases, e.g.: best vines, hashtags, @mentions, etc.