SlideShare a Scribd company logo
Identifying (and fixing)
oslo.messaging &
RabbitMQ issues
Michael Klishin, Pivotal
Dmitry Mescheryakov, Mirantis
What is oslo.messaging?
● Library for
○ building RPC clients/servers
○ emitting/handling notifications
What is oslo.messaging?
● Library for
○ building RPC clients/servers
○ emitting/handling notifications
● Supports several backends:
○ RabbitMQ
■ based on Kombu - the oldest and most well known (and we will speak about it)
■ based on Pika - recent addition
○ AMQP 1.0
What is oslo.messaging?
● Library for
○ building RPC clients/servers
○ emitting/handling notifications
● Supports several backends:
○ RabbitMQ
■ based on Kombu - the oldest and most well known (and we will speak about it)
■ based on Pika - recent addition
○ AMQP 1.0
What is oslo.messaging?
● Library for
○ building RPC clients/servers
○ emitting/handling notifications
● Supports several backends:
○ RabbitMQ
■ based on Kombu - the oldest and most well known (and we will speak about it)
■ based on Pika - recent addition
○ AMQP 1.0
Spawning a VM in Nova
nova-api
nova-api
nova-api
nova-
conductor
nova-
conductor
nova-
scheduler
nova-
scheduler
nova-
scheduler
nova-
compute
nova-
compute
nova-
compute
nova-
compute
Client
HTTP
RPC
Examples
Internal:
● nova-compute sends a report to nova-conductor every minute
● nova-conductor sends a command to spawn a VM to nova-compute
● neutron-l3-agent requests router list from neutron-server
● …
Examples
Internal:
● nova-compute sends a report to nova-conductor every minute
● nova-conductor sends a command to spawn a VM to nova-compute
● neutron-l3-agent requests router list from neutron-server
● …
External:
● Every OpenStack service sends notifications to Ceilometer
Where is RabbitMQ in this picture?
nova-
conductor
nova-
compute
RabbitMQ
compute.node-1.domain.tld
reply_b6686f7be58b4773a2e0f5475368d19a
request
response
RPC
Spotting oslo.messaging logs
Spotting oslo.messaging logs
2016-04-15 11:16:57.239 16181 DEBUG nova.service [req-d83ae554-7ef5-4299-
82ce-3f70b00b6490 - - - - -] Creating RPC server for service scheduler start
/usr/lib/python2.7/dist-packages/nova/service.py:218
2016-04-15 11:16:57.258 16181 DEBUG oslo.messaging._drivers.pool [req-
d83ae554-7ef5-4299-82ce-3f70b00b6490 - - - - -] Pool creating new connection
create /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/pool.py:109
...
File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line
420, in _send
result = self._waiter.wait(msg_id, timeout)
File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line
318, in wait
message = self.waiters.get(msg_id, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line
223, in get
'to message ID %s' % msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID
9e4a677887134a0cbc134649cd46d1ce
My favorite oslo.messaging exception
oslo.messaging operations
● Cast - fire RPC request and forget about it
● Notify - the same, only format is different
● Call - send RPC request and receive reply
Call throws a MessagingTimeout exception when a reply isn’t received in a certain
amount of time
Making a Call
1. Client -> request -> RabbitMQ
2. RabbitMQ -> request -> Server
3. Server processes the request and produces the response
4. Server -> response -> RabbitMQ
5. RabbitMQ -> response -> Client
If the process gets stuck on any step from 2 to 5, client gets a MessagingTimeout
exception.
Debug shows the truth
L3 Agent log
CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin
received reply msg_id: ae63b165611f439098f1461f906270de
* Examples from Mitaka
Debug shows the truth
L3 Agent log
CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin
received reply msg_id: ae63b165611f439098f1461f906270de
Neutron Server
received message msg_id: ae63b165611f439098f1461f906270de reply to:
reply_df2405440ffb40969a2f52c769f72e30
REPLY msg_id: ae63b165611f439098f1461f906270de reply queue:
reply_df2405440ffb40969a2f52c769f72e30
* Examples from Mitaka
Enabling the debug
[DEFAULT]
debug=true
Enabling the debug
[DEFAULT]
debug=true
default_log_levels=...,oslo.messaging=DEBUG,...
If you don’t have debug enabled
Examine the stack trace
Find which operation failed
Guess the destination service
Try to find correlating log entries around the time the request was made
If you don’t have debug enabled
Examine the stack trace
Find which operation failed
Guess the destination service
Try to find correlating log entries around the time the request was made
File "/opt/stack/neutron/neutron/agent/dhcp/agent.py", line 571, in _report_state
self.state_rpc.report_state(ctx, self.agent_state, self.use_call)
File "/opt/stack/neutron/neutron/agent/rpc.py", line 86, in report_state
return method(context, 'report_state', **kwargs)
File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
Diagnosing issues through RabbitMQ
● # rabbitmqctl list_queues consumers name
0 consumers indicate that nobody listens to the queue
● # rabbitmqctl list_queues messages consumers name
If a queue has consumers, but also messages are accumulating there. It
means that the corresponding service can not process messages in time or got
stuck in a deadlock or cluster is partitioned
Checking RabbitMQ cluster for integrity
# rabbitmqctl cluster_status
Check that its output contains all the nodes in the cluster. You might find that your
cluster is partitioned.
Partitioning is a good reason for some messages to get stuck in queues.
How to fix such issues
For RabbitMQ issues including partitioning, see RabbitMQ docs
Restart of the affected services helps in most cases
How to fix such issues
For RabbitMQ issues including partitioning, see RabbitMQ docs
Restart of the affected services helps in most cases
Force close connections using `rabbitmqctl` or HTTP API
Never set amqp_auto_delete = true
Use a queue expiration policy instead, with a TTL of at least 1 minute
Starting from Mitaka all by default auto-delete queues were replaced with expiring
ones
Why not amqp_auto_delete?
nova-
conductor
nova-
compute
RabbitMQ
compute.node-1.domain.tld
message
auto-delete
auto-delete = true
network hiccup
Queue mirroring is quite expensive
Out testing shows 2x drop in throughput on 3-node cluster with ‘ha-mode: all’
policy comparing with non-mirrored queues.
RPC can live without it
But notifications might be too important (if used for billing)
In later case enable mirroring for notification queues only (example in Fuel)
Use different backends for RPC and Notifications
Different drivers
* Available starting from Mitaka
Use different backends for RPC and Notifications
Different drivers
Same driver. For example:
RPC messages go through one RabbitMQ cluster
Notification messages go through another RabbitMQ cluster
* Available starting from Mitaka
Use different backends for RPC and Notifications
Different drivers
Same driver. For example:
RPC messages go through one RabbitMQ cluster
Notification messages go through another RabbitMQ cluster
Implementation (non-documented)
* Available starting from Mitaka
Troubleshooting common oslo.messaging and RabbitMQ issues
Part 2
Troubleshooting common oslo.messaging and RabbitMQ issues
Erlang VM process disappears
Erlang VM process disappears
Syslog, kern.log, /var/log/messages: grep for “killed process”
Erlang VM process disappears
Syslog, kern.log, /var/log/messages: grep for “killed process”
“Cannot allocate 1117203264527168 bytes of memory (of type …)” — move to
Erlang 17.5 or 18.3
RAM usage
RAM usage
`rabbitmqctl status`
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
Stats DB overload
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
`rabbitmqctl status` reports most RAM used by `mgmt_db`
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
`rabbitmqctl status` reports most RAM used by `mgmt_db`
You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db),
please_terminate).’`
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
`rabbitmqctl status` reports most RAM used by `mgmt_db`
You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db),
please_terminate).’`
Resetting is a safe thing to do but may confuse your monitoring tools
Stats DB overload
Connections, channels, queues, and nodes emit stats on a timer
With a lot of those the stats DB collector can fall behind
`rabbitmqctl status` reports most RAM used by `mgmt_db`
You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db),
please_terminate).’`
Resetting is a safe thing to do but may confuse your monitoring tools
New better parallelized event collector coming in RabbitMQ 3.6.2
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
`rabbitmqctl list_connections | wc -l`
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
`rabbitmqctl list_connections | wc -l`
`rabbitmqctl list_channels | wc -l`
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
`rabbitmqctl list_connections | wc -l`
`rabbitmqctl list_channels | wc -l`
Reduce TCP buffer size: RabbitMQ Networking guide
RAM usage
`rabbitmqctl status`
`rabbitmqctl list_queues name messages memory consumers`
rabbitmq_top
`rabbitmqctl list_connections | wc -l`
`rabbitmqctl list_channels | wc -l`
Reduce TCP buffer size: RabbitMQ Networking guide
To force per-connection channel limit use`rabbit.channel_max`.
Unresponsive nodes
Unresponsive nodes
`rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'`
Unresponsive nodes
`rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'`
Pivotal & Erlang Solutions contributed a few Mnesia deadlock fixes in
Erlang/OTP 18.3.1 and 19.0
TCP connections are rejected
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
Ensure RabbitMQ listens on correct network interfaces
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
Ensure RabbitMQ listens on correct network interfaces
Check open file handles limit (defaults on Linux are completely inadequate)
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
Ensure RabbitMQ listens on correct network interfaces
Check open file handles limit (defaults on Linux are completely inadequate)
TCP connection backlog size: rabbitmq.tcp_listen_options.backlog,
net.core.somaxconn
TCP connections are rejected
Ensure traffic on RabbitMQ ports is accepted by firewall
Ensure RabbitMQ listens on correct network interfaces
Check open file handles limit (defaults on Linux are completely inadequate)
TCP connection backlog size: rabbitmq.tcp_listen_options.backlog,
net.core.somaxconn
Consult RabbitMQ logs for authentication and authorization errors
TLS connections fail
TLS connections fail
Deserves a talk of its own
TLS connections fail
Deserves a talk of its own
See log files
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
`openssl s_server` (`man 1 s_server`)
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
`openssl s_server` (`man 1 s_server`)
Ensure peer CA certificate is trusted and verification depth is sufficient
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
`openssl s_server` (`man 1 s_server`)
Ensure peer CA certificate is trusted and verification depth is sufficient
Troubleshooting TLS on rabbitmq.com
TLS connections fail
Deserves a talk of its own
See log files
`openssl s_client` (`man 1 s_client`)
`openssl s_server` (`man 1 s_server`)
Ensure peer CA certificate is trusted and verification depth is sufficient
Troubleshooting TLS on rabbitmq.com
Run Erlang 17.5 or 18.3.1
Message payload inspection
Message payload inspection
Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
Message payload inspection
Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
rabbitmq_tracing
Message payload inspection
Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
rabbitmq_tracing
Tracing puts *very* high load on the system
Message payload inspection
Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
rabbitmq_tracing
Tracing puts *very* high load on the system
Wireshark (tcpdump, …)
Higher than expected latency
Higher than expected latency
Wireshark (tcpdump, …)
Higher than expected latency
Wireshark (tcpdump, …)
strace, DTrace, …
Higher than expected latency
Wireshark (tcpdump, …)
strace, DTrace, …
Erlang VM scheduler-to-core binding (pinning)
General remarks
General remarks
Guessing is not effective (or efficient)
General remarks
Guessing is not effective (or efficient)
Use tools to gather more data
General remarks
Guessing is not effective (or efficient)
Use tools to gather more data
Always consult log files
General remarks
Guessing is not effective (or efficient)
Use tools to gather more data
Always consult log files
Ask on rabbitmq-users
Troubleshooting common oslo.messaging and RabbitMQ issues
Thank you
Thank you
@michaelklishin
Thank you
@michaelklishin
rabbitmq-users

More Related Content

PPTX
Overview of Distributed Virtual Router (DVR) in Openstack/Neutron
PDF
Linux Networking Explained
PDF
High-Performance Networking Using eBPF, XDP, and io_uring
PDF
Open vSwitch 패킷 처리 구조
PPTX
OpenvSwitch Deep Dive
PPTX
Enable DPDK and SR-IOV for containerized virtual network functions with zun
PPTX
OVN - Basics and deep dive
PDF
The linux networking architecture
Overview of Distributed Virtual Router (DVR) in Openstack/Neutron
Linux Networking Explained
High-Performance Networking Using eBPF, XDP, and io_uring
Open vSwitch 패킷 처리 구조
OpenvSwitch Deep Dive
Enable DPDK and SR-IOV for containerized virtual network functions with zun
OVN - Basics and deep dive
The linux networking architecture

What's hot (20)

PDF
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
PPTX
Meetup 23 - 02 - OVN - The future of networking in OpenStack
PDF
introduction to linux kernel tcp/ip ptocotol stack
PDF
eBPF Trace from Kernel to Userspace
PDF
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
PDF
BPF - in-kernel virtual machine
PDF
Kubernetes
PDF
2019.06.27 Intro to Ceph
PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
PDF
Red Hat OpenStack 17 저자직강+스터디그룹_1주차
PDF
Linux Linux Traffic Control
PDF
Open stack networking vlan, gre
PPTX
Dpdk applications
PDF
DevConf 2014 Kernel Networking Walkthrough
PDF
Neutron packet logging framework
PPTX
K8s in 3h - Kubernetes Fundamentals Training
PDF
Performance optimization for all flash based on aarch64 v2.0
PDF
Virtualization with KVM (Kernel-based Virtual Machine)
PPTX
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
PDF
Faster packet processing in Linux: XDP
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Meetup 23 - 02 - OVN - The future of networking in OpenStack
introduction to linux kernel tcp/ip ptocotol stack
eBPF Trace from Kernel to Userspace
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
BPF - in-kernel virtual machine
Kubernetes
2019.06.27 Intro to Ceph
LinuxCon 2015 Linux Kernel Networking Walkthrough
Red Hat OpenStack 17 저자직강+스터디그룹_1주차
Linux Linux Traffic Control
Open stack networking vlan, gre
Dpdk applications
DevConf 2014 Kernel Networking Walkthrough
Neutron packet logging framework
K8s in 3h - Kubernetes Fundamentals Training
Performance optimization for all flash based on aarch64 v2.0
Virtualization with KVM (Kernel-based Virtual Machine)
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Faster packet processing in Linux: XDP
Ad

Viewers also liked (20)

PPTX
How to Troubleshoot OpenStack Without Losing Sleep
PDF
Troubleshooting RabbitMQ and services that use it
PDF
RabbitMQ Operations
PPTX
Scalable Open Source
ODP
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
PDF
3 years with Clojure
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Linux Systems Performance 2016
PPTX
Broken Linux Performance Tools 2016
PDF
BPF: Tracing and more
PDF
Velocity 2015 linux perf tools
PDF
Linux Profiling at Netflix
PDF
Open source responsibly
PPTX
Js remote conf
PDF
Atf 3 q15-6 - solutions for scaling the cloud computing network infrastructure
PPTX
Hypervisor Selection in CloudStack and OpenStack
PPT
Mistral Hong Kong Unconference track
PDF
The Messy Underlay Dilemma - automating PKI at Defragcon
PDF
Mistral Atlanta design session
PPTX
RabbitMq
How to Troubleshoot OpenStack Without Losing Sleep
Troubleshooting RabbitMQ and services that use it
RabbitMQ Operations
Scalable Open Source
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
3 years with Clojure
Linux Performance Analysis: New Tools and Old Secrets
Linux Systems Performance 2016
Broken Linux Performance Tools 2016
BPF: Tracing and more
Velocity 2015 linux perf tools
Linux Profiling at Netflix
Open source responsibly
Js remote conf
Atf 3 q15-6 - solutions for scaling the cloud computing network infrastructure
Hypervisor Selection in CloudStack and OpenStack
Mistral Hong Kong Unconference track
The Messy Underlay Dilemma - automating PKI at Defragcon
Mistral Atlanta design session
RabbitMq
Ad

Similar to Troubleshooting common oslo.messaging and RabbitMQ issues (20)

PPTX
Hunting for APT in network logs workshop presentation
PDF
Developing Realtime Data Pipelines With Apache Kafka
PPT
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
PDF
Percona XtraDB 集群文档
PDF
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PPTX
Docker Swarm secrets for creating great FIWARE platforms
PDF
Python web conference 2022 apache pulsar development 101 with python (f li-...
PPTX
Apache Kafka
PPTX
002 hbase clientapi
PPTX
How Yelp does Service Discovery
PDF
bigdata 2022_ FLiP Into Pulsar Apps
PDF
Timothy Spann: Apache Pulsar for ML
PDF
Fast Streaming into Clickhouse with Apache Pulsar
PDF
Training Slides: 153 - Working with the CLI
DOC
Use perl creating web services with xml rpc
PDF
Streaming Processing with a Distributed Commit Log
PPTX
Open stack HA - Theory to Reality
DOCX
project_docs
PDF
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Hunting for APT in network logs workshop presentation
Developing Realtime Data Pipelines With Apache Kafka
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
Percona XtraDB 集群文档
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Developing Real-Time Data Pipelines with Apache Kafka
Docker Swarm secrets for creating great FIWARE platforms
Python web conference 2022 apache pulsar development 101 with python (f li-...
Apache Kafka
002 hbase clientapi
How Yelp does Service Discovery
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann: Apache Pulsar for ML
Fast Streaming into Clickhouse with Apache Pulsar
Training Slides: 153 - Working with the CLI
Use perl creating web services with xml rpc
Streaming Processing with a Distributed Commit Log
Open stack HA - Theory to Reality
project_docs
Cluster_Performance_Apache_Kafak_vs_RabbitMQ

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Encapsulation theory and applications.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
Heart disease approach using modified random forest and particle swarm optimi...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A Presentation on Artificial Intelligence
Advanced methodologies resolving dimensionality complications for autism neur...
SOPHOS-XG Firewall Administrator PPT.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Encapsulation theory and applications.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Programs and apps: productivity, graphics, security and other tools
Digital-Transformation-Roadmap-for-Companies.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Per capita expenditure prediction using model stacking based on satellite ima...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
OMC Textile Division Presentation 2021.pptx
A comparative analysis of optical character recognition models for extracting...
Tartificialntelligence_presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm

Troubleshooting common oslo.messaging and RabbitMQ issues

  • 1. Identifying (and fixing) oslo.messaging & RabbitMQ issues Michael Klishin, Pivotal Dmitry Mescheryakov, Mirantis
  • 2. What is oslo.messaging? ● Library for ○ building RPC clients/servers ○ emitting/handling notifications
  • 3. What is oslo.messaging? ● Library for ○ building RPC clients/servers ○ emitting/handling notifications ● Supports several backends: ○ RabbitMQ ■ based on Kombu - the oldest and most well known (and we will speak about it) ■ based on Pika - recent addition ○ AMQP 1.0
  • 4. What is oslo.messaging? ● Library for ○ building RPC clients/servers ○ emitting/handling notifications ● Supports several backends: ○ RabbitMQ ■ based on Kombu - the oldest and most well known (and we will speak about it) ■ based on Pika - recent addition ○ AMQP 1.0
  • 5. What is oslo.messaging? ● Library for ○ building RPC clients/servers ○ emitting/handling notifications ● Supports several backends: ○ RabbitMQ ■ based on Kombu - the oldest and most well known (and we will speak about it) ■ based on Pika - recent addition ○ AMQP 1.0
  • 6. Spawning a VM in Nova nova-api nova-api nova-api nova- conductor nova- conductor nova- scheduler nova- scheduler nova- scheduler nova- compute nova- compute nova- compute nova- compute Client HTTP RPC
  • 7. Examples Internal: ● nova-compute sends a report to nova-conductor every minute ● nova-conductor sends a command to spawn a VM to nova-compute ● neutron-l3-agent requests router list from neutron-server ● …
  • 8. Examples Internal: ● nova-compute sends a report to nova-conductor every minute ● nova-conductor sends a command to spawn a VM to nova-compute ● neutron-l3-agent requests router list from neutron-server ● … External: ● Every OpenStack service sends notifications to Ceilometer
  • 9. Where is RabbitMQ in this picture? nova- conductor nova- compute RabbitMQ compute.node-1.domain.tld reply_b6686f7be58b4773a2e0f5475368d19a request response RPC
  • 11. Spotting oslo.messaging logs 2016-04-15 11:16:57.239 16181 DEBUG nova.service [req-d83ae554-7ef5-4299- 82ce-3f70b00b6490 - - - - -] Creating RPC server for service scheduler start /usr/lib/python2.7/dist-packages/nova/service.py:218 2016-04-15 11:16:57.258 16181 DEBUG oslo.messaging._drivers.pool [req- d83ae554-7ef5-4299-82ce-3f70b00b6490 - - - - -] Pool creating new connection create /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/pool.py:109
  • 12. ... File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 420, in _send result = self._waiter.wait(msg_id, timeout) File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 318, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 223, in get 'to message ID %s' % msg_id) MessagingTimeout: Timed out waiting for a reply to message ID 9e4a677887134a0cbc134649cd46d1ce My favorite oslo.messaging exception
  • 13. oslo.messaging operations ● Cast - fire RPC request and forget about it ● Notify - the same, only format is different ● Call - send RPC request and receive reply Call throws a MessagingTimeout exception when a reply isn’t received in a certain amount of time
  • 14. Making a Call 1. Client -> request -> RabbitMQ 2. RabbitMQ -> request -> Server 3. Server processes the request and produces the response 4. Server -> response -> RabbitMQ 5. RabbitMQ -> response -> Client If the process gets stuck on any step from 2 to 5, client gets a MessagingTimeout exception.
  • 15. Debug shows the truth L3 Agent log CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin received reply msg_id: ae63b165611f439098f1461f906270de * Examples from Mitaka
  • 16. Debug shows the truth L3 Agent log CALL msg_id: ae63b165611f439098f1461f906270de exchange: neutron topic: q-reports-plugin received reply msg_id: ae63b165611f439098f1461f906270de Neutron Server received message msg_id: ae63b165611f439098f1461f906270de reply to: reply_df2405440ffb40969a2f52c769f72e30 REPLY msg_id: ae63b165611f439098f1461f906270de reply queue: reply_df2405440ffb40969a2f52c769f72e30 * Examples from Mitaka
  • 19. If you don’t have debug enabled Examine the stack trace Find which operation failed Guess the destination service Try to find correlating log entries around the time the request was made
  • 20. If you don’t have debug enabled Examine the stack trace Find which operation failed Guess the destination service Try to find correlating log entries around the time the request was made File "/opt/stack/neutron/neutron/agent/dhcp/agent.py", line 571, in _report_state self.state_rpc.report_state(ctx, self.agent_state, self.use_call) File "/opt/stack/neutron/neutron/agent/rpc.py", line 86, in report_state return method(context, 'report_state', **kwargs) File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
  • 21. Diagnosing issues through RabbitMQ ● # rabbitmqctl list_queues consumers name 0 consumers indicate that nobody listens to the queue ● # rabbitmqctl list_queues messages consumers name If a queue has consumers, but also messages are accumulating there. It means that the corresponding service can not process messages in time or got stuck in a deadlock or cluster is partitioned
  • 22. Checking RabbitMQ cluster for integrity # rabbitmqctl cluster_status Check that its output contains all the nodes in the cluster. You might find that your cluster is partitioned. Partitioning is a good reason for some messages to get stuck in queues.
  • 23. How to fix such issues For RabbitMQ issues including partitioning, see RabbitMQ docs Restart of the affected services helps in most cases
  • 24. How to fix such issues For RabbitMQ issues including partitioning, see RabbitMQ docs Restart of the affected services helps in most cases Force close connections using `rabbitmqctl` or HTTP API
  • 25. Never set amqp_auto_delete = true Use a queue expiration policy instead, with a TTL of at least 1 minute Starting from Mitaka all by default auto-delete queues were replaced with expiring ones
  • 27. Queue mirroring is quite expensive Out testing shows 2x drop in throughput on 3-node cluster with ‘ha-mode: all’ policy comparing with non-mirrored queues. RPC can live without it But notifications might be too important (if used for billing) In later case enable mirroring for notification queues only (example in Fuel)
  • 28. Use different backends for RPC and Notifications Different drivers * Available starting from Mitaka
  • 29. Use different backends for RPC and Notifications Different drivers Same driver. For example: RPC messages go through one RabbitMQ cluster Notification messages go through another RabbitMQ cluster * Available starting from Mitaka
  • 30. Use different backends for RPC and Notifications Different drivers Same driver. For example: RPC messages go through one RabbitMQ cluster Notification messages go through another RabbitMQ cluster Implementation (non-documented) * Available starting from Mitaka
  • 34. Erlang VM process disappears
  • 35. Erlang VM process disappears Syslog, kern.log, /var/log/messages: grep for “killed process”
  • 36. Erlang VM process disappears Syslog, kern.log, /var/log/messages: grep for “killed process” “Cannot allocate 1117203264527168 bytes of memory (of type …)” — move to Erlang 17.5 or 18.3
  • 39. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers`
  • 41. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer
  • 42. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind
  • 43. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind `rabbitmqctl status` reports most RAM used by `mgmt_db`
  • 44. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind `rabbitmqctl status` reports most RAM used by `mgmt_db` You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’`
  • 45. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind `rabbitmqctl status` reports most RAM used by `mgmt_db` You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’` Resetting is a safe thing to do but may confuse your monitoring tools
  • 46. Stats DB overload Connections, channels, queues, and nodes emit stats on a timer With a lot of those the stats DB collector can fall behind `rabbitmqctl status` reports most RAM used by `mgmt_db` You can reset it: `rabbitmqctl eval ‘exit(erlang:whereis(rabbit_mgmt_db), please_terminate).’` Resetting is a safe thing to do but may confuse your monitoring tools New better parallelized event collector coming in RabbitMQ 3.6.2
  • 47. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top
  • 48. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top `rabbitmqctl list_connections | wc -l`
  • 49. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top `rabbitmqctl list_connections | wc -l` `rabbitmqctl list_channels | wc -l`
  • 50. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top `rabbitmqctl list_connections | wc -l` `rabbitmqctl list_channels | wc -l` Reduce TCP buffer size: RabbitMQ Networking guide
  • 51. RAM usage `rabbitmqctl status` `rabbitmqctl list_queues name messages memory consumers` rabbitmq_top `rabbitmqctl list_connections | wc -l` `rabbitmqctl list_channels | wc -l` Reduce TCP buffer size: RabbitMQ Networking guide To force per-connection channel limit use`rabbit.channel_max`.
  • 53. Unresponsive nodes `rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'`
  • 54. Unresponsive nodes `rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'` Pivotal & Erlang Solutions contributed a few Mnesia deadlock fixes in Erlang/OTP 18.3.1 and 19.0
  • 56. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall
  • 57. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall Ensure RabbitMQ listens on correct network interfaces
  • 58. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall Ensure RabbitMQ listens on correct network interfaces Check open file handles limit (defaults on Linux are completely inadequate)
  • 59. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall Ensure RabbitMQ listens on correct network interfaces Check open file handles limit (defaults on Linux are completely inadequate) TCP connection backlog size: rabbitmq.tcp_listen_options.backlog, net.core.somaxconn
  • 60. TCP connections are rejected Ensure traffic on RabbitMQ ports is accepted by firewall Ensure RabbitMQ listens on correct network interfaces Check open file handles limit (defaults on Linux are completely inadequate) TCP connection backlog size: rabbitmq.tcp_listen_options.backlog, net.core.somaxconn Consult RabbitMQ logs for authentication and authorization errors
  • 62. TLS connections fail Deserves a talk of its own
  • 63. TLS connections fail Deserves a talk of its own See log files
  • 64. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`)
  • 65. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`) `openssl s_server` (`man 1 s_server`)
  • 66. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`) `openssl s_server` (`man 1 s_server`) Ensure peer CA certificate is trusted and verification depth is sufficient
  • 67. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`) `openssl s_server` (`man 1 s_server`) Ensure peer CA certificate is trusted and verification depth is sufficient Troubleshooting TLS on rabbitmq.com
  • 68. TLS connections fail Deserves a talk of its own See log files `openssl s_client` (`man 1 s_client`) `openssl s_server` (`man 1 s_server`) Ensure peer CA certificate is trusted and verification depth is sufficient Troubleshooting TLS on rabbitmq.com Run Erlang 17.5 or 18.3.1
  • 70. Message payload inspection Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace
  • 71. Message payload inspection Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace rabbitmq_tracing
  • 72. Message payload inspection Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace rabbitmq_tracing Tracing puts *very* high load on the system
  • 73. Message payload inspection Message tracing: `rabbitmqctl trace_on -p my-vhost`, amq.rabbitmq.trace rabbitmq_tracing Tracing puts *very* high load on the system Wireshark (tcpdump, …)
  • 75. Higher than expected latency Wireshark (tcpdump, …)
  • 76. Higher than expected latency Wireshark (tcpdump, …) strace, DTrace, …
  • 77. Higher than expected latency Wireshark (tcpdump, …) strace, DTrace, … Erlang VM scheduler-to-core binding (pinning)
  • 79. General remarks Guessing is not effective (or efficient)
  • 80. General remarks Guessing is not effective (or efficient) Use tools to gather more data
  • 81. General remarks Guessing is not effective (or efficient) Use tools to gather more data Always consult log files
  • 82. General remarks Guessing is not effective (or efficient) Use tools to gather more data Always consult log files Ask on rabbitmq-users

Editor's Notes

  • #16: Casts don’t have message id, but are distinguished by a unique_id
  • #17: Casts don’t have message id, but are distinguished by a unique_id
  • #23: Depends on to which partition sender and listener are connected.