How to monitor your micro-service with Prometheus?

How to monitor your micro-service with
Prometheus?
How to design the metrics?
WOJCIECH BARCZYŃSKI - SMACC.IO | 2 OCTOBER 2018

ABOUT ME
Lead So ware Developer - SMACC (FinTech/AI)
Before:
System Engineer i Developer Lyke
Before:
1000+ nodes, 20 data centers with Openstack
Point of view:
Startups, fast-moving environment

OBSERVABILITY
Monitoring
Logging
Tracing

OBSERVABILITY
Go for Industrial Programming by Peter Bourgon

NOT A SILVER-BULLET
but:
Easy to setup
Immediately value
Suprisengly: the last one implemented

CENTRALIZED LOGGING
Usually much too late
Post-mortem
Hard to find the needle
Like a debugging vs testing

MONITORING
Numbers
Trends
Dependencies
+ Actions

METRIC
Name Label Value
traefik_requests_total code="200",
method="GET"
3001

MONITORING
Example from couchbase blog

HOW TO FIND THE RIGHT METRIC?
USE
RED

USE
Utilization the average time that the resource was
busy servicing work
Saturation extra work which it can't service, o en
queued
Errors the count of error events
Documented and Promoted by Berdan Gregg

USE
Utilization: as a percent over a time interval: "one
disk is running at 90% utilization".
Saturation:
Errors:

USE
Utilization:
Saturation: as a queue length. eg, "the CPUs have
an average run queue length of four".
Errors:

USE
utilization:
saturation:
errors: scalar counts. eg, "this network interface
drops packages".

USE
traditionaly more instance oriented
still useful in the microservices world

RED
Rate How busy is your service?
Error Errors
Duration What is the latency of my service?
.Tom Wilkie's guideline for instrumenting applications

RED
Rate - how many request per seconds handled
Error
Duration (distribution)

RED
Rate
Error - how many request per seconds handled we
failed
Duration

RED
Rate
Error
Duration - how long the requests took

RED
Follow Four Golden Signals by Google SREs [1]
Focus on what matters for end-users
[1] Latency, Traﬀic, Errors, Saturation ( )src

RED
Not recommended for:
batch-oriented
streaming services

WHAT PROMETHEUS IS?
Aggregation of time-series data
Not an event-based system

PROMETHEUS STACK
Prometheus - collect
Alertmanager - alerts
Grafana - visualize

PROMETHEUS
Wide support for languages
Metrics collected over HTTP
Pull model (see scrape time), push-mode possible
integration with k8s
PromQL
metrics/

METRICS IN PLAIN TEXT
# HELP order_mgmt_audit_duration_seconds Multiprocess metric
# TYPE order_mgmt_audit_duration_seconds summary
order_mgmt_audit_duration_seconds_count{status_code="200"} 41.
order_mgmt_audit_duration_seconds_sum{status_code="200"} 27.44
order_mgmt_audit_duration_seconds_count{status_code="500"} 1.0
order_mgmt_audit_duration_seconds_sum{status_code="500"} 0.716
# HELP order_mgmt_duration_seconds Multiprocess metric
# TYPE order_mgmt_duration_seconds summary
order_mgmt_duration_seconds_count{method="GET",path="/complex"
order_mgmt_duration_seconds_sum{method="GET",path="/complex",s
order_mgmt_duration_seconds_count{method="GET",path="/",status
order_mgmt_duration_seconds_sum{method="GET",path="/",status_c
order_mgmt_duration_seconds_count{method="GET",path="/complex"
order_mgmt_duration_seconds_sum{method="GET",path="/complex",s

METRICS IN PLAIN TEXT
# HELP go_gc_duration_seconds A summary of the GC invocation d
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 9.01e-05
go_gc_duration_seconds{quantile="0.25"} 0.000141101
go_gc_duration_seconds{quantile="1"} 0.006099658
go_gc_duration_seconds_sum 18.749046756
go_gc_duration_seconds_count 89273

EXPORTERS
Mongodb
Mysql
Postgresql
rabbitmq
...
also Blackbox exporter
examples: ,memcached psql

CLOUD-NATIVE PROJECTS INTEGRATION
API
BACKOFFICE 1
DATA
WEB
ADMIN
BACKOFFICE 2
BACKOFFICE 3
API.DOMAIN.COM
DOMAIN.COM/WEB
BACKOFFICE.DOMAIN.COM
ORCHESTRATOR
PRIVATE NETWORKINTERNET
API
LISTEN
(DOCKER, SWARM, MESOS...)
- --web.metrics.prometheus

PROMETHEUS PromQL
working with historams:
rates:
more complex:
histogram_quantile(0.9,
rate(http_req_duration_seconds_bucket[10m]
rate(http_requests_total{job="api-server"}[5
irate(http_requests_total{job="api-server"}
redict_linear()
holt_winters()

PROMETHEUS PromQL
Alarming:
ALERT ProductionAppServiceInstanceDown
IF up { environment = "production", app =~ ".+"} == 0
FOR 4m
ANNOTATIONS {
summary = "Instance of {{$labels.app}} is down",
description = " Instance {{$labels.instance}} of app
}

METRICS
Counter - just up
Gauge - up/down
Histogram
Summary

HISTOGRAM
traefik_duration_seconds_bucket
{method="GET,code="200"}
{le="0.1"} 2229
{le="0.3"} 107
{le="1.2"} 100
{le="5"} 4
{le="+Inf"} 2
_sum
_count 2342

SUMMARY
http_request_duration_seconds
{quantile="0.5"} 4
{quantile="0.9"} 5
http_request_duration_seconds_sum 9
http_request_duration_seconds_count 3

HISTOGRAM / SUMMARY:
Latency of services
Request or Request size
Histograms recommended

RED
Metric + PromQL:
sum(irate(order_mgmt_duration_seconds_count
{job=~".*"}[1m])) by (status_code)

METRIC AND LABEL NAMING
Best practises on :
service name is your prefix user_
state the bae unit _seconds and _bytes
metric names

PYTHON CLIENT
client_python
Counter
Gauge
Summary
Histogram

DEMO: SIMPLE REST SERVICE
----------- ---------------
| App | ----->| Audit Service |
| OrderMgmt | | |
----------- ---------------
|
| ---------------
-------->| Database |
---------------

DEMO:
- service
- prometheus
- grafana
- alertmanager
http://127.0.0.1:8080
http://127.0.0.1:8080/metrics/

DEMO
☁ src ⚡ make docker_run
☁ src ⚡ docker ps
CONTAINER ID IMAGE PORTS
5f824d1bc789 grafana/grafana:5.2.2 0.0.0.0:3000->3
d681a414a8b6 prom/prometheus:v2.1.0 0.0.0.0:9090->9
ea0d9233e159 prom/alertmanager:v0.15.1 0.0.0.0:9093->9

DEMO: GENERATE CALLS
With error injection
☁ src ⚡ make srv_wrk_random

How to monitor your micro-service with Prometheus?

KILL THE SERVICE
☁ src ⚡ docker stop pycode-prom-flask_order-manager_1

DEMO: PYTHON CODE
Metric Definition
Metric Collection

DEMO: SIMULATING CALLS
make docker_build
make docker_run

curl 127.0.0.1:8080/hello
curl 127.0.0.1:8080/world
curl 127.0.0.1:8080/complex

curl 127.0.0.1:8080/complex?is_srv_error=True
curl 127.0.0.1:8080/complex?is_db_error=True
curl 127.0.0.1:8080/complex?db_sleep=3&srv_sleep=2
# load generator
make srv_wrk_random

DEMO: PROM STACK
Prometheus dashboard and config
AlertManager dashboard and config
Simulate the successful and failed calls
Simple Queries for rate

PromQL
sum(irate(order_mgmt_duration_seconds_count{job=~".*"}[1m]))
by (status_code)

PromQL
order_mgmt_duration_seconds_sum{job=~".*"} or
order_mgmt_database_duration_seconds_sum{job=~".*"} or
order_mgmt_audit_duration_seconds_sum{job=~".*"}

BEST PRACTISES
Py: higher load requires muliprocessing
Start simple (up/down), later add more complex
rules
Sum over Summaries with Q leads to incorrect
results, see prom docs

SUMMARY
Monitoring saves your time
Checking logs Kibana vs Grafana is like debuging vs
having tests
Logging -> high TCO

SUMMARY
Testing
Testing in Production
Smoke tests / Acceptance Tests
Monitoring Simple
(up/down + KPI)
Monitoring
Explorations / Logs

PROMETHUS - LABELS IN ALERT RULES
The labels are propageted to alert rules:
see ../src/prometheus/etc/alert.rules
ALERT ProductionAppServiceInstanceDown
IF up { environment = "production", app =~ ".+"} == 0
FOR 4m
ANNOTATIONS {
summary = "Instance of {{$labels.app}} is down",
description = " Instance {{$labels.instance}} of app
}

ALERTMANGER - LABELS IN ALERTMANGER
Call somebody if the label is severity=page:
see ../src/alertmanager/*.conf
---
group_by: [cluster]
# If an alert isn't caught by a route, send it to the pager.
receiver: team-pager
routes:
- match:
severity: page
receiver: team-pager
receivers:
- name: team-pager
opsgenie_configs:
- api_key: $API_KEY
teams: example_team

PROMETHEUS - PUSH MODEL
See:
Good for short living jobs in your cluster.
https://prometheus.io/docs/instrumenting/pushing/

DESIGNING METRIC NAMES
Which one is better?
request_duration{app=my_app}
my_app_request_duration
see documentation on best practises for andmetric naming instrumentation

DESIGNING METRIC NAMES
Which one is better?
order_mgmt_db_duration_seconds_sum
order_mgmt_duration_seconds_sum{dep_name='db

PROMETHEUS + K8S = <3
LABELS ARE PROPAGATED FROM K8S TO
PROMETHEUS

INTEGRATION WITH PROMETHEUS
cat memcached-0-service.yaml
https://github.com/skarab7/kubernetes-memcached
---
apiVersion: v1
kind: Service
metadata:
name: memcached-0
labels:
app: memcached
kubernetes.io/name: "memcached"
role: shard-0
annotations:
prometheus.io/scrape: "true"
prometheus.io/scheme: "http"
prometheus.io/path: "metrics"
prometheus.io/port: "9150"
spec:

How to monitor your micro-service with Prometheus?

More Related Content

What's hot (20)

Similar to How to monitor your micro-service with Prometheus? (20)

More from Wojciech Barczyński (11)

Recently uploaded (20)

How to monitor your micro-service with Prometheus?