SlideShare a Scribd company logo
1
Metrics Are Not Enough
Gwen Shapira, Product Manager
@gwenshap
Monitoring Apache Kafka and Streaming Applications
2
Monitoring Distributed Systems is hard
“Google SRE team with 10–12 members
typically has one or sometimes two members
whose primary assignment is to build and maintain
monitoring systems for their service.”
https://www.oreilly.com/ideas/monitoring-distributed-systems
3
Apache Kafka is a distributed system and has many components
4
Many Moving Parts to Watch
• Producers
• Consumers
• Consumer Groups
• Brokers
• Controller
• Zookeeper
• Topics
• Partitions
• Messages
• …..
5
And many metrics to monitor
• Broker throughput
• Topic throughput
• Disk utilization
• Unclean leader elections
• Network pool usage
• Request pool usage
• Request latencies – 30 request types, 5 phases
each
• Topic partition status counts: online, under
replicated, offline
• Log flush rates
• ZK disconnects
• Garbage collection pauses
• Message delivery
• Consumer groups reading from topics
• …​
6
Every Service that uses Kafka is a Distributed System
Orders
Service
Stock
Service
Fulfilment
Service
Fraud Detection
Service
Mobile App
Kafka
7
It is all CRITICAL to your business
• Real-time applications mean very little room for errors
• Is Kafka available and performing well? You need to know before your users do.
• You must detect and act on small problems before they escalate
• The business cares a lot about accuracy and SLAs
• It is 8:05am, does the dashboard reflect the status of the system up to 8am?
• Continuously improve performance
• Monitor Kafka cluster performance
• Identify and act on leading indicators of future problems
• Quick triage – can you identify likely causes of a problem quickly and effectively?
8
So you may need a bit of help
• Operators must have visibility into the health
of the Kafka cluster
• The business must have visibility into
completeness and latency of message
delivery
• Everyone needs to focus on the most
meaningful metrics
9
Types of monitoring
• Tailing logs
• OS metrics
• Kafka / Client metrics
• Tracing applications
• Event level sampling
• APM – Application performance from user perspective
• …
10
Types of monitoring
• Tailing logs
• OS metrics
• Kafka / Client metrics
• Tracing applications
• Event level sampling
• APM – Application performance from user perspective
• …
11
Monitor System Health of Your Cluster
12
The basics
• Whatever else you do: Check that the broker process is running
• External agent
• Or alert on stale metrics
• Don’t alert on everything. Fewer, high level alerts are better.
13
First Things First
14
Under-replicated partitions
• If you can monitor just one thing…
• Is it a specific broker?
• Cluster wide:
• Out of resources
• Imbalance
• Broker:
• Hardware
• Noisy neighbor
• Configuration
15
Drill Down into Broker and Topic: Do we see a problem right here?
16
Check partition placement - is the issue specific to one broker?
17
Don’t watch the dashboard
• Control Center detects anomalous events in monitoring data
• Users can define triggers
• Control Center performs customizable actions when triggers occur
• When troubleshooting Kafka issues, users can view previous alerts and historical message delivery
data at the time the alert occurred
18
Capacity Planning – Be Proactive
• Capacity planning ensures that your cluster can continue to meet business demands
• Control Center provides indicators if a cluster may need more brokers
• Key metrics that indicate a cluster is near capacity:
• CPU
• Network and thread pool usage
• Request latencies
• Network utilization - Throughput, per broker and per cluster
• Disk utilization - Disk space used by all log segments, per broker
19
Multi-Cluster Deployments
• Monitor all clusters in one place
20
Monitor End to End Message Delivery
21
Are You Meeting SLAs?
• Stream monitoring helps you determine if all messages are delivered end-to-end in a timely manner
• This is important for several reasons:
• Ensure producers and consumers are not losing messages
• Check if consumers are consuming more than expected
• Verify low latency for real-time applications
• Identify slow consumers
22
How to monitor?
The infamous LinkedIn “Audit”:
• Count messages when they are produced
• Count messages when they are consumed
• Check timestamps when they are consumed
• Compare the results
23
Message delivery metrics
Streaming message delivery metrics are available:
• Aggregate
• Per-consumer group
• Per-topic
24
Under Consumption
• Reasons for under consumption:
• Producers not handling errors and retried correctly
• Misbehaving consumers, perhaps the consumer did not follow shutdown sequence
• Real-time apps intentionally skipping messages
• Red bars indicate some messages were not consumed
• Herringbone pattern can indicate error in measurement
• Usually improper shutdown of client
25
Over Consumption
• Reasons for over consumption
• Consumers may be processing a set of messages more than once, which may have impact on their
applications
• Consumption bars are higher than the expected consumption lines
• Latency may be higher
26
Slow Consumers
• Identify consumers and consumer groups that are not keeping up with data production
• Use the per-consumer and per-consumer group metrics
• Compare a slow, lagging consumer (left) to a good consumer (right)
• The slow consumer (left) is processing all the messages, but with high latency
• Slow consumers may also process fewer messages in a given time window, so monitor "Expected
consumption" (the top line)
27
Optimize Performance
28
Identify Performance Bottlenecks
• Real-time applications require high throughput or low latency
• Need to baseline where you are
• Monitor for changes to get ahead of the problem
• You may need to identify performance bottlenecks
• Break-down the times for the end-to-end dataflow to give you pointers where streams are taking the
most processing time
• The key metrics to look at include:
• Request latencies
• Network pool usage
• Request pool usage
29
Produce and Fetch Request Latencies
Breakdown produce and fetch latencies through the
entire request lifecycle
Request latency values can be shown at the median,
95th, 99th, or 99.9th percentile
30
Request Latencies Explained (1)
• Total request latency (center)
• Total time of an entire request lifecycle, from the broker point of view
• Request queue
• The time the request is in the request queue waiting for an IO thread
• A high value can indicate there are not enough IO threads or CPU is a bottleneck
• Also check: What are those IO threads doing?
• Request local
• The time the request is being processed locally by the leader
• A high value can imply slow disk so monitor broker disk IO
31
Request Latencies Explained (2)
• Response remote
• The time the request is waiting on other brokers
• Higher times are expected on high-reliability or high-throughput systems
• A high value can indicate a slow network connection, or the consumer is caught up to the end of the log
• Response queue
• The time the request is in the response queue waiting for a network thread
• A high value can imply there are not enough network threads
• Response send
• The time the request is being sent back to the consumer
• A high value can imply the CPU or network is a bottleneck
32
Network and Request Handler Threads
• Network pool usage
• Average network pool capacity usage across all brokers, i.e. the fraction of time the network processor
threads are not idle
• If network pool usage is above 70%, isolate bottleneck with the request latency breakdown
• Consider increasing the broker configuration parameter num.network.threads, especially if Response
queue metric is high and you have resources
• Request pool usage
• Average request handler capacity usage across all brokers, i.e. the fraction of time the request handler
threads are not idle
• If request pool usage is above 70%, isolate bottleneck with the request latency breakdown
• Consider increasing the broker configuration parameter num.io.threads, especially if Request queue
metric is high
• Why are all your handlers busy? Check GC, access patterns and disk IO
33
Summary
34
Few things to remember…
• Monitor Kafka
• Work with your developers to monitor critical applications end-to-end
• More data is better: Metrics + logs + OS + APM + …
• But fewer alerts are better
• Alert on what’s important – Under—Replicated Partitions is a good start
• DON’T JUST FIDDLE WITH STUFF
• AND DON’T RESTART KAFKA FOR LOLS
• If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
35
And as you start your Production Kafka Journey…
Plan
Validate
Deploy
Observe
Analyze
36
Thank You!

More Related Content

What's hot (20)

PDF
Kafka 101 and Developer Best Practices
confluent
 
PDF
Grafana Loki: like Prometheus, but for Logs
Marco Pracucci
 
PDF
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
PDF
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
NETWAYS
 
PPTX
BGP FlowSpec experience and future developments
Pavel Odintsov
 
PPTX
How is Kafka so Fast?
Ricardo Paiva
 
PDF
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Vietnam Open Infrastructure User Group
 
PDF
Introduction to Apache Flink
datamantra
 
PDF
XStream: stream processing platform at facebook
Aniket Mokashi
 
PPTX
Kafka timestamp offset
DaeMyung Kang
 
PDF
Introduction to FreeSWITCH
Chien Cheng Wu
 
PDF
Room 3 - 4 - Lê Quang Hiếu - How to be a cool dad: Leverage DIY Home Automati...
Vietnam Open Infrastructure User Group
 
PDF
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze
 
PPTX
Apache kafka
Jemin Patel
 
PPTX
Apache Kafka at LinkedIn
Guozhang Wang
 
PDF
Efficient monitoring and alerting
Tobias Schmidt
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PDF
Cloud Monitoring tool Grafana
Dhrubaji Mandal ♛
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Kafka 101 and Developer Best Practices
confluent
 
Grafana Loki: like Prometheus, but for Logs
Marco Pracucci
 
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
NETWAYS
 
BGP FlowSpec experience and future developments
Pavel Odintsov
 
How is Kafka so Fast?
Ricardo Paiva
 
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Vietnam Open Infrastructure User Group
 
Introduction to Apache Flink
datamantra
 
XStream: stream processing platform at facebook
Aniket Mokashi
 
Kafka timestamp offset
DaeMyung Kang
 
Introduction to FreeSWITCH
Chien Cheng Wu
 
Room 3 - 4 - Lê Quang Hiếu - How to be a cool dad: Leverage DIY Home Automati...
Vietnam Open Infrastructure User Group
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze
 
Apache kafka
Jemin Patel
 
Apache Kafka at LinkedIn
Guozhang Wang
 
Efficient monitoring and alerting
Tobias Schmidt
 
Introduction to Apache Kafka
Jeff Holoman
 
Cloud Monitoring tool Grafana
Dhrubaji Mandal ♛
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 

Similar to Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications (20)

PPTX
Monitoring Apache Kafka
confluent
 
PPTX
Putting Kafka Into Overdrive
Todd Palino
 
PPTX
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Ontico
 
PDF
Why is My Stream Processing Job Slow? with Xavier Leaute
Databricks
 
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
PDF
Perfug 20-11-2019 - Kafka Performances
Florent Ramiere
 
PDF
OnPrem Monitoring.pdf
TarekHamdi8
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
PDF
Cruise Control: Effortless management of Kafka clusters
Prateek Maheshwari
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PDF
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
HostedbyConfluent
 
PDF
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
HostedbyConfluent
 
PDF
Citi Tech Talk: Monitoring and Performance
confluent
 
PDF
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Otávio Carvalho
 
PDF
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
HostedbyConfluent
 
PDF
Microservices, Kafka Streams and KafkaEsque
confluent
 
PPTX
Salesforce enabling real time scenarios at scale using kafka
Thomas Alex
 
PPTX
Kafka infrastructure monitoring
lambdaloopers
 
PDF
Lessons learned from building Demand Side Platform
bbogacki
 
PDF
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
Monitoring Apache Kafka
confluent
 
Putting Kafka Into Overdrive
Todd Palino
 
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Ontico
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Databricks
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
Perfug 20-11-2019 - Kafka Performances
Florent Ramiere
 
OnPrem Monitoring.pdf
TarekHamdi8
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Cruise Control: Effortless management of Kafka clusters
Prateek Maheshwari
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
HostedbyConfluent
 
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
HostedbyConfluent
 
Citi Tech Talk: Monitoring and Performance
confluent
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Otávio Carvalho
 
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
HostedbyConfluent
 
Microservices, Kafka Streams and KafkaEsque
confluent
 
Salesforce enabling real time scenarios at scale using kafka
Thomas Alex
 
Kafka infrastructure monitoring
lambdaloopers
 
Lessons learned from building Demand Side Platform
bbogacki
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Knoldus Inc.
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Ad

Recently uploaded (20)

PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
PDF
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PDF
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
PPTX
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
PDF
UiPath Agentic AI ile Akıllı Otomasyonun Yeni Çağı
UiPathCommunity
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
𝙳𝚘𝚠𝚗𝚕𝚘𝚊𝚍—Wondershare Filmora Crack 14.0.7 + Key Download 2025
sebastian aliya
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
UiPath Agentic AI ile Akıllı Otomasyonun Yeni Çağı
UiPathCommunity
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 

Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications

  • 1. 1 Metrics Are Not Enough Gwen Shapira, Product Manager @gwenshap Monitoring Apache Kafka and Streaming Applications
  • 2. 2 Monitoring Distributed Systems is hard “Google SRE team with 10–12 members typically has one or sometimes two members whose primary assignment is to build and maintain monitoring systems for their service.” https://www.oreilly.com/ideas/monitoring-distributed-systems
  • 3. 3 Apache Kafka is a distributed system and has many components
  • 4. 4 Many Moving Parts to Watch • Producers • Consumers • Consumer Groups • Brokers • Controller • Zookeeper • Topics • Partitions • Messages • …..
  • 5. 5 And many metrics to monitor • Broker throughput • Topic throughput • Disk utilization • Unclean leader elections • Network pool usage • Request pool usage • Request latencies – 30 request types, 5 phases each • Topic partition status counts: online, under replicated, offline • Log flush rates • ZK disconnects • Garbage collection pauses • Message delivery • Consumer groups reading from topics • …​
  • 6. 6 Every Service that uses Kafka is a Distributed System Orders Service Stock Service Fulfilment Service Fraud Detection Service Mobile App Kafka
  • 7. 7 It is all CRITICAL to your business • Real-time applications mean very little room for errors • Is Kafka available and performing well? You need to know before your users do. • You must detect and act on small problems before they escalate • The business cares a lot about accuracy and SLAs • It is 8:05am, does the dashboard reflect the status of the system up to 8am? • Continuously improve performance • Monitor Kafka cluster performance • Identify and act on leading indicators of future problems • Quick triage – can you identify likely causes of a problem quickly and effectively?
  • 8. 8 So you may need a bit of help • Operators must have visibility into the health of the Kafka cluster • The business must have visibility into completeness and latency of message delivery • Everyone needs to focus on the most meaningful metrics
  • 9. 9 Types of monitoring • Tailing logs • OS metrics • Kafka / Client metrics • Tracing applications • Event level sampling • APM – Application performance from user perspective • …
  • 10. 10 Types of monitoring • Tailing logs • OS metrics • Kafka / Client metrics • Tracing applications • Event level sampling • APM – Application performance from user perspective • …
  • 11. 11 Monitor System Health of Your Cluster
  • 12. 12 The basics • Whatever else you do: Check that the broker process is running • External agent • Or alert on stale metrics • Don’t alert on everything. Fewer, high level alerts are better.
  • 14. 14 Under-replicated partitions • If you can monitor just one thing… • Is it a specific broker? • Cluster wide: • Out of resources • Imbalance • Broker: • Hardware • Noisy neighbor • Configuration
  • 15. 15 Drill Down into Broker and Topic: Do we see a problem right here?
  • 16. 16 Check partition placement - is the issue specific to one broker?
  • 17. 17 Don’t watch the dashboard • Control Center detects anomalous events in monitoring data • Users can define triggers • Control Center performs customizable actions when triggers occur • When troubleshooting Kafka issues, users can view previous alerts and historical message delivery data at the time the alert occurred
  • 18. 18 Capacity Planning – Be Proactive • Capacity planning ensures that your cluster can continue to meet business demands • Control Center provides indicators if a cluster may need more brokers • Key metrics that indicate a cluster is near capacity: • CPU • Network and thread pool usage • Request latencies • Network utilization - Throughput, per broker and per cluster • Disk utilization - Disk space used by all log segments, per broker
  • 19. 19 Multi-Cluster Deployments • Monitor all clusters in one place
  • 20. 20 Monitor End to End Message Delivery
  • 21. 21 Are You Meeting SLAs? • Stream monitoring helps you determine if all messages are delivered end-to-end in a timely manner • This is important for several reasons: • Ensure producers and consumers are not losing messages • Check if consumers are consuming more than expected • Verify low latency for real-time applications • Identify slow consumers
  • 22. 22 How to monitor? The infamous LinkedIn “Audit”: • Count messages when they are produced • Count messages when they are consumed • Check timestamps when they are consumed • Compare the results
  • 23. 23 Message delivery metrics Streaming message delivery metrics are available: • Aggregate • Per-consumer group • Per-topic
  • 24. 24 Under Consumption • Reasons for under consumption: • Producers not handling errors and retried correctly • Misbehaving consumers, perhaps the consumer did not follow shutdown sequence • Real-time apps intentionally skipping messages • Red bars indicate some messages were not consumed • Herringbone pattern can indicate error in measurement • Usually improper shutdown of client
  • 25. 25 Over Consumption • Reasons for over consumption • Consumers may be processing a set of messages more than once, which may have impact on their applications • Consumption bars are higher than the expected consumption lines • Latency may be higher
  • 26. 26 Slow Consumers • Identify consumers and consumer groups that are not keeping up with data production • Use the per-consumer and per-consumer group metrics • Compare a slow, lagging consumer (left) to a good consumer (right) • The slow consumer (left) is processing all the messages, but with high latency • Slow consumers may also process fewer messages in a given time window, so monitor "Expected consumption" (the top line)
  • 28. 28 Identify Performance Bottlenecks • Real-time applications require high throughput or low latency • Need to baseline where you are • Monitor for changes to get ahead of the problem • You may need to identify performance bottlenecks • Break-down the times for the end-to-end dataflow to give you pointers where streams are taking the most processing time • The key metrics to look at include: • Request latencies • Network pool usage • Request pool usage
  • 29. 29 Produce and Fetch Request Latencies Breakdown produce and fetch latencies through the entire request lifecycle Request latency values can be shown at the median, 95th, 99th, or 99.9th percentile
  • 30. 30 Request Latencies Explained (1) • Total request latency (center) • Total time of an entire request lifecycle, from the broker point of view • Request queue • The time the request is in the request queue waiting for an IO thread • A high value can indicate there are not enough IO threads or CPU is a bottleneck • Also check: What are those IO threads doing? • Request local • The time the request is being processed locally by the leader • A high value can imply slow disk so monitor broker disk IO
  • 31. 31 Request Latencies Explained (2) • Response remote • The time the request is waiting on other brokers • Higher times are expected on high-reliability or high-throughput systems • A high value can indicate a slow network connection, or the consumer is caught up to the end of the log • Response queue • The time the request is in the response queue waiting for a network thread • A high value can imply there are not enough network threads • Response send • The time the request is being sent back to the consumer • A high value can imply the CPU or network is a bottleneck
  • 32. 32 Network and Request Handler Threads • Network pool usage • Average network pool capacity usage across all brokers, i.e. the fraction of time the network processor threads are not idle • If network pool usage is above 70%, isolate bottleneck with the request latency breakdown • Consider increasing the broker configuration parameter num.network.threads, especially if Response queue metric is high and you have resources • Request pool usage • Average request handler capacity usage across all brokers, i.e. the fraction of time the request handler threads are not idle • If request pool usage is above 70%, isolate bottleneck with the request latency breakdown • Consider increasing the broker configuration parameter num.io.threads, especially if Request queue metric is high • Why are all your handlers busy? Check GC, access patterns and disk IO
  • 34. 34 Few things to remember… • Monitor Kafka • Work with your developers to monitor critical applications end-to-end • More data is better: Metrics + logs + OS + APM + … • But fewer alerts are better • Alert on what’s important – Under—Replicated Partitions is a good start • DON’T JUST FIDDLE WITH STUFF • AND DON’T RESTART KAFKA FOR LOLS • If you don’t know what you are doing, it is ok. There’s support (and Cloud) for that.
  • 35. 35 And as you start your Production Kafka Journey… Plan Validate Deploy Observe Analyze