How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022

How Kafka Powers the World’s Most Popular
Vector Database
Frank Liu

Speaker
Frank Liu
Director of Operations & ML Architect
frank@zilliz.com
https://linkedin.com/in/fzliu
https://twitter.com/frankzliu

01 Unstructured Data and Embeddings
CONTENTS
02 Vector Database Overview
03 Milvus Architecture
04 Kafka as a Messaging Backbone

01
Unstructured Data and Embeddings

What is Unstructured Data?
Any data that does not conform to a pre-de
fi
ned data model.

Using Vectors to Represent Data
Embeddings!

Vector Database Overview
A database purpose-built to store, index, and query large quantities of
embeddings.

Vector Databases in Production

Milvus Features
• Supports hardware accelerators
• SIMD support on CPUs
• GPU support for faster querying & indexing
• Supports key database functions
• Data partitioning and data sharing
• Filtered queries and searches
• Multiple options for indices and similarity metrics
• FAISS (HNSW, Flat, PQ), ANNOY, DiskANN, ScANN, etc…
• Euclidean, dot product (cosine), boolean
• A number of SDKs
• Python
• Go
• Node
• Java

Cloud Nativity
• Kubernetes native
• Deployment through Helm
• Native S3 support
• MinIO-based design
• Azure Blob and GCS support
• Easy on-prem to cloud conversion
• Fully distributed
• Highly elastic and horizontally scalable
• Disaggregated storage and compute (shared storage)
• Separate read, write, and background (indexing) services

Access Layer
• Data-related languages (SQL context)
• Data de
fi
nition language: modify/de
fi
ne database schema
• Data management language: store, modify, and retrieve data
• Data control language: de
fi
ne user rights and permissions
• Access layer = multiple proxy nodes
• Proxy node functions
• Manage message ingestion and routing
• Points DDL and DCL instructions to coordinators
• Point DML to log for for worker consumption

Coordinator Layer
• Root coordinator node
• Handles DDL and DCL requests
• Data coordinator node
• Triggers background data operations (
fl
ush, compact, etc)
• Manages data node cluster
• Maintains metadata of inserted data
• Query coordinator node
• Manages query node cluster
• Index coordinator node
• Manages index node cluster
• Determines when indexes are built
• Maintains index metadata

Worker Layer
• Worker overview
• All workers are stateless
• All DML requests are handled by workers
• Data node
• Retrieves incremental log data from log
• Packs and stores log data into log snapshots
• Processes mutation requests
• Query node
• Loads indexes and data from object storage
• Runs searches and queries
• Index node
• Builds indexes on inserted data

Storage Layer
• Log broker - Kafka
• Streaming data persistence
• Execution of reliable asynchronous queries
• Event noti
fi
cation
• Metadata storage - etcd
• Service registration and health checks
• Message consumption checkpoints
• Object storage - S3/MinIO
• Stores snapshot
fi
les of logs
• Stores index
fi
les for scalar and vector data
• Stores intermediate query results

Key Takeaways
• Single coordinator instance per service type
• Coordinators manage corresponding worker node cluster
• Data is stored in Collections
• Akin to collections in MongoDB or tables in relational databases
• Disaggregation of query, indexing, and data
• Signi
fi
cant horizontal scalability
• Support for a wide range of application requirements
• Message streams are core to Milvus
• All data passes through message queue
• Kafka innate cloud nativity allows Milvus to easily scale

04
Kafka as a Messasging Backbone

Milvus’ Messaging Backbone
• Log as data
• Operations are centralized around the log broker
• CRUD operations by subscribing to and consuming logs
• Pub/sub scheme allows for stream & batch processing
• Decoupling of read and write components
• Coordinators manage corresponding worker node cluster
• Support for both streaming and batched execution
• Data nodes read from streams and write to binlog
• Streaming uses WAL, batching uses binlog
• All requests that change system state go through WAL
• Create collection, delete collection
• Insert, update, delete vector

Example: Vector Insert
• Loggers are organized into a hash ring
• Time Stamp Oracle (TSO) ensures logger consistency
• Different channels for different requests
• Prevents request type interference
• Data nodes subscribe to speci
fi
c channels
• Inserts hashed across multiple channels (+ef
fi
ciency)
• Data nodes can be freely expanded to increase throughput
• Convert row-based WAL to column-based binlogs
• Kafka’s cloud nativity powers Milvus’ scalability

THANK YOU FOR LISTENING
https://github.com/milvus-io/milvus
https://zilliz.com

How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022

More Related Content

What's hot (20)

Similar to How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022 (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022