SlideShare a Scribd company logo
How Kafka Powers the World’s Most Popular
Vector Database
Frank Liu
Speaker
Frank Liu
Director of Operations & ML Architect
frank@zilliz.com
https://linkedin.com/in/fzliu
https://twitter.com/frankzliu
01 Unstructured Data and Embeddings
CONTENTS
02 Vector Database Overview
03 Milvus Architecture
04 Kafka as a Messaging Backbone
01
Unstructured Data and Embeddings
Vector Database Overview
What is Unstructured Data?
Any data that does not conform to a pre-de
fi
ned data model.
The Evolution of Data
Using Vectors to Represent Data
Embeddings!
02
Vector Database Overview
Vector Database Overview
A database purpose-built to store, index, and query large quantities of
embeddings.
Vector Databases in Production
Milvus Features
• Supports hardware accelerators
• SIMD support on CPUs
• GPU support for faster querying & indexing
• Supports key database functions
• Data partitioning and data sharing
• Filtered queries and searches
• Multiple options for indices and similarity metrics
• FAISS (HNSW, Flat, PQ), ANNOY, DiskANN, ScANN, etc…
• Euclidean, dot product (cosine), boolean
• A number of SDKs
• Python
• Go
• Node
• Java
Cloud Nativity
• Kubernetes native
• Deployment through Helm
• Native S3 support
• MinIO-based design
• Azure Blob and GCS support
• Easy on-prem to cloud conversion
• Fully distributed
• Highly elastic and horizontally scalable
• Disaggregated storage and compute (shared storage)
• Separate read, write, and background (indexing) services
Vector Database Ecosystem
03
Milvus Architecture
Milvus Architecture
Access Layer
• Data-related languages (SQL context)
• Data de
fi
nition language: modify/de
fi
ne database schema
• Data management language: store, modify, and retrieve data
• Data control language: de
fi
ne user rights and permissions
• Access layer = multiple proxy nodes
• Proxy node functions
• Manage message ingestion and routing
• Points DDL and DCL instructions to coordinators
• Point DML to log for for worker consumption
Coordinator Layer
• Root coordinator node
• Handles DDL and DCL requests
• Data coordinator node
• Triggers background data operations (
fl
ush, compact, etc)
• Manages data node cluster
• Maintains metadata of inserted data
• Query coordinator node
• Manages query node cluster
• Index coordinator node
• Manages index node cluster
• Determines when indexes are built
• Maintains index metadata
Worker Layer
• Worker overview
• All workers are stateless
• All DML requests are handled by workers
• Data node
• Retrieves incremental log data from log
• Packs and stores log data into log snapshots
• Processes mutation requests
• Query node
• Loads indexes and data from object storage
• Runs searches and queries
• Index node
• Builds indexes on inserted data
Storage Layer
• Log broker - Kafka
• Streaming data persistence
• Execution of reliable asynchronous queries
• Event noti
fi
cation
• Metadata storage - etcd
• Service registration and health checks
• Message consumption checkpoints
• Object storage - S3/MinIO
• Stores snapshot
fi
les of logs
• Stores index
fi
les for scalar and vector data
• Stores intermediate query results
Key Takeaways
• Single coordinator instance per service type
• Coordinators manage corresponding worker node cluster
• Data is stored in Collections
• Akin to collections in MongoDB or tables in relational databases
• Disaggregation of query, indexing, and data
• Signi
fi
cant horizontal scalability
• Support for a wide range of application requirements
• Message streams are core to Milvus
• All data passes through message queue
• Kafka innate cloud nativity allows Milvus to easily scale
04
Kafka as a Messasging Backbone
Milvus’ Messaging Backbone
Milvus’ Messaging Backbone
• Log as data
• Operations are centralized around the log broker
• CRUD operations by subscribing to and consuming logs
• Pub/sub scheme allows for stream & batch processing
• Decoupling of read and write components
• Coordinators manage corresponding worker node cluster
• Support for both streaming and batched execution
• Data nodes read from streams and write to binlog
• Streaming uses WAL, batching uses binlog
• All requests that change system state go through WAL
• Create collection, delete collection
• Insert, update, delete vector
Example: Vector Insert
Example: Vector Insert
• Loggers are organized into a hash ring
• Time Stamp Oracle (TSO) ensures logger consistency
• Different channels for different requests
• Prevents request type interference
• Data nodes subscribe to speci
fi
c channels
• Inserts hashed across multiple channels (+ef
fi
ciency)
• Data nodes can be freely expanded to increase throughput
• Convert row-based WAL to column-based binlogs
• Kafka’s cloud nativity powers Milvus’ scalability
Kafka Powers Milvus
THANK YOU FOR LISTENING
https://github.com/milvus-io/milvus
https://zilliz.com

More Related Content

PDF
Parquet performance tuning: the missing guide
PDF
Productizing Structured Streaming Jobs
PPTX
Apache Flink and what it is used for
PDF
Change Data Feed in Delta
PPTX
An Introduction to Druid
PPTX
Couchbase presentation
PDF
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
PPTX
Cql – cassandra query language
Parquet performance tuning: the missing guide
Productizing Structured Streaming Jobs
Apache Flink and what it is used for
Change Data Feed in Delta
An Introduction to Druid
Couchbase presentation
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Cql – cassandra query language

What's hot (20)

PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
MongoDB.pptx
PDF
Elasticsearch
PDF
Making Apache Spark Better with Delta Lake
PPTX
Free Training: How to Build a Lakehouse
PDF
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PDF
Intro to Cypher
PPTX
Real-time Analytics with Trino and Apache Pinot
PPTX
Apache Spark Fundamentals
PDF
Spark with Delta Lake
PDF
Accelerating Data Ingestion with Databricks Autoloader
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
PDF
Write Faster SQL with Trino.pdf
PDF
Introduction to Cassandra
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
The Parquet Format and Performance Optimization Opportunities
MongoDB.pptx
Elasticsearch
Making Apache Spark Better with Delta Lake
Free Training: How to Build a Lakehouse
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Intro to Cypher
Real-time Analytics with Trino and Apache Pinot
Apache Spark Fundamentals
Spark with Delta Lake
Accelerating Data Ingestion with Databricks Autoloader
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Apache Kafka Architecture & Fundamentals Explained
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Write Faster SQL with Trino.pdf
Introduction to Cassandra
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Ad

Similar to How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022 (20)

PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
PDF
Big Data Architecture Workshop - Vahid Amiri
PDF
About "Apache Cassandra"
PDF
Keeping Data Fresh: Mastering Updates in Vector Databases
PPTX
Distributed messaging through Kafka
PPTX
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
PPTX
Webinar: Data Streaming with Apache Kafka & MongoDB
PPTX
Data Streaming with Apache Kafka & MongoDB - EMEA
PPTX
Software architecture for data applications
PDF
Webinar: Data Streaming with Apache Kafka & MongoDB
PPTX
Data Streaming with Apache Kafka & MongoDB
PPTX
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
PDF
Managing Big Data: An Introduction to Data Intensive Computing
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
PDF
JDD 2016 - Michal Matloka - Small Intro To Big Data
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PDF
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
PDF
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
PPTX
Ai big dataconference_ml_fastdata_vitalii bondarenko
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Big Data Architecture Workshop - Vahid Amiri
About "Apache Cassandra"
Keeping Data Fresh: Mastering Updates in Vector Databases
Distributed messaging through Kafka
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Webinar: Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB - EMEA
Software architecture for data applications
Webinar: Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
Managing Big Data: An Introduction to Data Intensive Computing
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
NoSQL A brief look at Apache Cassandra Distributed Database
JDD 2016 - Michal Matloka - Small Intro To Big Data
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Ai big dataconference_ml_fastdata_vitalii bondarenko
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Getting Started with Data Integration: FME Form 101
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
1. Introduction to Computer Programming.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Getting Started with Data Integration: FME Form 101
MYSQL Presentation for SQL database connectivity
1. Introduction to Computer Programming.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
Assigned Numbers - 2025 - Bluetooth® Document
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Tartificialntelligence_presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Agricultural_Statistics_at_a_Glance_2022_0.pdf

How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022

  • 1. How Kafka Powers the World’s Most Popular Vector Database Frank Liu
  • 2. Speaker Frank Liu Director of Operations & ML Architect [email protected] https://linkedin.com/in/fzliu https://twitter.com/frankzliu
  • 3. 01 Unstructured Data and Embeddings CONTENTS 02 Vector Database Overview 03 Milvus Architecture 04 Kafka as a Messaging Backbone
  • 6. What is Unstructured Data? Any data that does not conform to a pre-de fi ned data model.
  • 8. Using Vectors to Represent Data Embeddings!
  • 10. Vector Database Overview A database purpose-built to store, index, and query large quantities of embeddings.
  • 11. Vector Databases in Production
  • 12. Milvus Features • Supports hardware accelerators • SIMD support on CPUs • GPU support for faster querying & indexing • Supports key database functions • Data partitioning and data sharing • Filtered queries and searches • Multiple options for indices and similarity metrics • FAISS (HNSW, Flat, PQ), ANNOY, DiskANN, ScANN, etc… • Euclidean, dot product (cosine), boolean • A number of SDKs • Python • Go • Node • Java
  • 13. Cloud Nativity • Kubernetes native • Deployment through Helm • Native S3 support • MinIO-based design • Azure Blob and GCS support • Easy on-prem to cloud conversion • Fully distributed • Highly elastic and horizontally scalable • Disaggregated storage and compute (shared storage) • Separate read, write, and background (indexing) services
  • 17. Access Layer • Data-related languages (SQL context) • Data de fi nition language: modify/de fi ne database schema • Data management language: store, modify, and retrieve data • Data control language: de fi ne user rights and permissions • Access layer = multiple proxy nodes • Proxy node functions • Manage message ingestion and routing • Points DDL and DCL instructions to coordinators • Point DML to log for for worker consumption
  • 18. Coordinator Layer • Root coordinator node • Handles DDL and DCL requests • Data coordinator node • Triggers background data operations ( fl ush, compact, etc) • Manages data node cluster • Maintains metadata of inserted data • Query coordinator node • Manages query node cluster • Index coordinator node • Manages index node cluster • Determines when indexes are built • Maintains index metadata
  • 19. Worker Layer • Worker overview • All workers are stateless • All DML requests are handled by workers • Data node • Retrieves incremental log data from log • Packs and stores log data into log snapshots • Processes mutation requests • Query node • Loads indexes and data from object storage • Runs searches and queries • Index node • Builds indexes on inserted data
  • 20. Storage Layer • Log broker - Kafka • Streaming data persistence • Execution of reliable asynchronous queries • Event noti fi cation • Metadata storage - etcd • Service registration and health checks • Message consumption checkpoints • Object storage - S3/MinIO • Stores snapshot fi les of logs • Stores index fi les for scalar and vector data • Stores intermediate query results
  • 21. Key Takeaways • Single coordinator instance per service type • Coordinators manage corresponding worker node cluster • Data is stored in Collections • Akin to collections in MongoDB or tables in relational databases • Disaggregation of query, indexing, and data • Signi fi cant horizontal scalability • Support for a wide range of application requirements • Message streams are core to Milvus • All data passes through message queue • Kafka innate cloud nativity allows Milvus to easily scale
  • 22. 04 Kafka as a Messasging Backbone
  • 24. Milvus’ Messaging Backbone • Log as data • Operations are centralized around the log broker • CRUD operations by subscribing to and consuming logs • Pub/sub scheme allows for stream & batch processing • Decoupling of read and write components • Coordinators manage corresponding worker node cluster • Support for both streaming and batched execution • Data nodes read from streams and write to binlog • Streaming uses WAL, batching uses binlog • All requests that change system state go through WAL • Create collection, delete collection • Insert, update, delete vector
  • 26. Example: Vector Insert • Loggers are organized into a hash ring • Time Stamp Oracle (TSO) ensures logger consistency • Different channels for different requests • Prevents request type interference • Data nodes subscribe to speci fi c channels • Inserts hashed across multiple channels (+ef fi ciency) • Data nodes can be freely expanded to increase throughput • Convert row-based WAL to column-based binlogs • Kafka’s cloud nativity powers Milvus’ scalability
  • 28. THANK YOU FOR LISTENING https://github.com/milvus-io/milvus https://zilliz.com