London devops logging

Practical logstash -
beyond the basics.
Tomas Doran (t0m) <bobtﬁsh@bobtﬁsh.net>

Who are you

• Sysadmin at TIM Group
• t0m on irc.freenode.net
• twitter.com/bobtfish
• github.com/bobtfish
• slideshare.com/bobtfish

Logstash
• I hope you already know what logstash is?

Logstash
• I’m going to talk about our implementation.

Logstash
• Elasticsearch

Logstash
• Elasticsearch
• Metrics

Logstash
• Elasticsearch
• Metrics
• Nagios

Logstash
• Elasticsearch
• Metrics
• Nagios
• Riemann

> 55 million messages a day

• Now ~30Gb of indexed data per day
• All our applications
• All of syslog
• Used by developers and product managers
• 2 x DL360s with 8x600Gb discs, also
graphite install

About 4 months old

• Almost all apps onboard to various levels
• All of syslog was easy
• Still haven’t done apache logs
• Haven’t comprehensively done router/
switches
• Lots of apps still emit directly to graphite

Java

• All our apps are Java / Scala / Clojure

Java

• https://github.com/tlrx/slf4j-logback-zeromq

Java

• Own layer (x2 1 Java, 1 Scala) for sending
structured events as JSON

Java

• Own layer (x2 1 Java, 1 Scala) for sending
structured events as JSON
• Java developers hate native code

On host log collector

• Need a lightweight log shipper.
• VMs with 1Gb of RAM..

• Message::Passing - perl library I wrote.
• Small, light, pluggable

• Application to logcollector is ZMQ
• Small amount of buffering (1000
messages)

• Application to logcollector is ZMQ
• Small amount of buffering (1000
messages)
• logcollector to logstash is ZMQ
• Large amount of buffering (disc ofﬂoad,
100s of thousands of messages)

ZeroMQ has the
correct semantics

ZeroMQ has the
correct semantics
• Pub/Sub sockets

ZeroMQ has the
correct semantics
• Pub/Sub sockets
• Never, ever blocking

ZeroMQ has the
correct semantics
• Pub/Sub sockets
• Lossy! (If needed)

ZeroMQ has the
correct semantics
• Pub/Sub sockets
• Buffer sizes / locations conﬁgureable

ZeroMQ has the
correct semantics
• Pub/Sub sockets
• Arbitrary message size

ZeroMQ has the
correct semantics
• Pub/Sub sockets
• Arbitrary message size
• IO done in a background thread (nice in
interpreted languages - ruby/perl/python)

What, no AMQP?

• Could go logcollector => AMQP =>
logstash for extra durability

What, no AMQP?

• ZMQ buffering ‘good enough’

What, no AMQP?

• logstash uses a pure ruby AMQP decoder

What, no AMQP?

• logstash uses a pure ruby AMQP decoder
• Slooooowwwwww

Reliability

• Multiple Elasticsearch servers (obvious)!

Reliability

• Multiple Elasticsearch servers (obvious)!
• Due to ZMQ buffering, you can:
• restart logstash, messages just buffer on
hosts whilst it’s unavailable
• restart logcollector, messages from apps
buffer (lose some syslog)

Reliability: TODO

• Elasticsearch cluster getting sick happens

Reliability: TODO

• In-ﬂight messages in logstash lost :(

Reliability: TODO

• In-ﬂight messages in logstash lost :(
• Solution - elasticsearch_river output
• logstash => durable RabbitMQ queue
• ES reads from queue
• Also faster - uses bulk API

Redundancy
• Add a UUID to each message at emission
point.

Redundancy
point.
• Index in elasticsearch by UUID

Redundancy
point.
• Emit to two backend logstash instances
(TODO)

Redundancy
point.
• Emit to two backend logstash instances
(TODO)
• Index everything twice! (TODO)

Elasticsearch
optimisation
• You need a template
• compress source
• disable _all
• discard unwanted ﬁelds from source /
indexing
• tweak shards and replicas
• compact your yesterday’s index at end of
day!

Elasticsearch size
• 87 daily indexes

Elasticsearch size
• 800Gb of data (per instance)

Elasticsearch size
• Just bumped ES heap to 22G
• Just writing data - 2Gb
• Query over all indexes - 17Gb!

Elasticsearch size
• Just bumped ES heap to 22G
• Just writing data - 2Gb
• Query over all indexes - 17Gb!
• Hang on - 800/87 does not = 33Gb/day!

Rate has increased!

Text
Text

We may have problems ﬁtting
onto 5 x 600Gb discs!

TimedWebRequest
• Most obvious example of a standard event
• App name
• Environment
• HTTP status
• Page generation time
• Request / Response size

TimedWebRequest
• Most obvious example of a standard event
• App name
• Environment
• HTTP status
• Page generation time
• Request / Response size
• Can derive loads of metrics from this!

statsd
• Rolls up counters and timers into metrics

statsd
• One bucket per stat, emits values every 10
seconds

statsd
seconds
• Counters: Request rate, HTTP status rate

statsd
seconds
• Counters: Request rate, HTTP status rate
• Timers: Total page time, mean page time,
min/max page times

JSON everywhere

• Legacy shell ftp mirror scripts
• gitolite hooks for deployments
• keepalived health checks

JSON everywhere
echo "JSON:{"nagios_service":"${SERVICE}",
"nagios_status":"${STATUS_CODE}",
"message":"${STATUS_TEXT}"}" |
logger -t nagios

Alerting use cases:

• Replaced nsca client with standardised log
pipeline
• Developers log an event and get (one!)
email warning of client side exceptions
• Passive health monitoring - ‘did we log
something recently’

Riemann

• Using for some simple health checking

Riemann

• logcollector health

Riemann

• logcollector health
• Load balancer instance health

Riemann
• Ambitious plans to do more

Riemann
• Web pool health (>= n nodes)

Riemann
• Replace statsd

Riemann
• Replace statsd
• Transit collectd data via logstash and
use to emit to graphite

Riemann
• Replace statsd
• Transit collectd data via logstash and
use to emit to graphite
• disc usage trending / prediction

Metadata

• It’s all about the metadata

Metadata

• Structured events are describable

Metadata

• Common patterns to give standard
metrics / alerting for free

Metadata

• Common patterns to give standard
metrics / alerting for free
• Dashboards!

Dashboard love/hate
• Riemann x 2

Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2

Dashboard love/hate
• Riemann x 2
• Nagios x 3

Dashboard love/hate
• Riemann x 2
• Nagios x 3
• CI radiator

Dashboard love/hate
• Riemann x 2
• Nagios x 3
• CI radiator

• Information overload!

Thanks!

• Questions?

• slides with more detail about my log
collector code:
• http://slideshare.net/bobtﬁsh/

London devops logging

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to London devops logging (20)

More from Tomas Doran (20)

London devops logging

Editor's Notes