SlideShare a Scribd company logo
Practical logstash -
 beyond the basics.
Tomas Doran (t0m) <bobtfish@bobtfish.net>
Who are you

• Sysadmin at TIM Group
• t0m on irc.freenode.net
• twitter.com/bobtfish
• github.com/bobtfish
• slideshare.com/bobtfish
Logstash
Logstash
• I hope you already know what logstash is?
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
 • Elasticsearch
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
 • Elasticsearch
 • Metrics
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
 • Elasticsearch
 • Metrics
 • Nagios
Logstash
• I hope you already know what logstash is?
• I’m going to talk about our implementation.
 • Elasticsearch
 • Metrics
 • Nagios
 • Riemann
London devops logging
> 55 million messages a day
> 55 million messages a day

• Now ~30Gb of indexed data per day
• All our applications
• All of syslog
• Used by developers and product managers
• 2 x DL360s with 8x600Gb discs, also
  graphite install
About 4 months old
About 4 months old

• Almost all apps onboard to various levels
• All of syslog was easy
• Still haven’t done apache logs
• Haven’t comprehensively done router/
  switches
• Lots of apps still emit directly to graphite
Java
Java

• All our apps are Java / Scala / Clojure
Java

• All our apps are Java / Scala / Clojure
• https://github.com/tlrx/slf4j-logback-zeromq
Java

• All our apps are Java / Scala / Clojure
• https://github.com/tlrx/slf4j-logback-zeromq
• Own layer (x2 1 Java, 1 Scala) for sending
  structured events as JSON
Java

• All our apps are Java / Scala / Clojure
• https://github.com/tlrx/slf4j-logback-zeromq
• Own layer (x2 1 Java, 1 Scala) for sending
  structured events as JSON
• Java developers hate native code
On host log collector
On host log collector

• Need a lightweight log shipper.
• VMs with 1Gb of RAM..

• Message::Passing - perl library I wrote.
• Small, light, pluggable
On host log collector
On host log collector
• Application to logcollector is ZMQ
 • Small amount of buffering (1000
    messages)
On host log collector
• Application to logcollector is ZMQ
 • Small amount of buffering (1000
    messages)
• logcollector to logstash is ZMQ
 • Large amount of buffering (disc offload,
    100s of thousands of messages)
ZeroMQ has the
correct semantics
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
• Lossy! (If needed)
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
• Lossy! (If needed)
• Buffer sizes / locations configureable
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
• Lossy! (If needed)
• Buffer sizes / locations configureable
• Arbitrary message size
ZeroMQ has the
    correct semantics
• Pub/Sub sockets
• Never, ever blocking
• Lossy! (If needed)
• Buffer sizes / locations configureable
• Arbitrary message size
• IO done in a background thread (nice in
  interpreted languages - ruby/perl/python)
What, no AMQP?
What, no AMQP?

• Could go logcollector => AMQP =>
  logstash for extra durability
What, no AMQP?

• Could go logcollector => AMQP =>
  logstash for extra durability
• ZMQ buffering ‘good enough’
What, no AMQP?

• Could go logcollector => AMQP =>
  logstash for extra durability
• ZMQ buffering ‘good enough’
• logstash uses a pure ruby AMQP decoder
What, no AMQP?

• Could go logcollector => AMQP =>
  logstash for extra durability
• ZMQ buffering ‘good enough’
• logstash uses a pure ruby AMQP decoder
• Slooooowwwwww
Reliability
Reliability

• Multiple Elasticsearch servers (obvious)!
Reliability

• Multiple Elasticsearch servers (obvious)!
• Due to ZMQ buffering, you can:
 • restart logstash, messages just buffer on
    hosts whilst it’s unavailable
  • restart logcollector, messages from apps
    buffer (lose some syslog)
Reliability: TODO
Reliability: TODO

• Elasticsearch cluster getting sick happens
Reliability: TODO

• Elasticsearch cluster getting sick happens
• In-flight messages in logstash lost :(
Reliability: TODO

• Elasticsearch cluster getting sick happens
• In-flight messages in logstash lost :(
• Solution - elasticsearch_river output
 • logstash => durable RabbitMQ queue
 • ES reads from queue
 • Also faster - uses bulk API
Redundancy
Redundancy
• Add a UUID to each message at emission
  point.
Redundancy
• Add a UUID to each message at emission
  point.
• Index in elasticsearch by UUID
Redundancy
• Add a UUID to each message at emission
  point.
• Index in elasticsearch by UUID
• Emit to two backend logstash instances
  (TODO)
Redundancy
• Add a UUID to each message at emission
  point.
• Index in elasticsearch by UUID
• Emit to two backend logstash instances
  (TODO)
• Index everything twice! (TODO)
Elasticsearch
         optimisation
• You need a template
 • compress source
 • disable _all
 • discard unwanted fields from source /
    indexing
 • tweak shards and replicas
• compact your yesterday’s index at end of
  day!
London devops logging
Elasticsearch size
Elasticsearch size
• 87 daily indexes
Elasticsearch size
• 87 daily indexes
• 800Gb of data (per instance)
Elasticsearch size
• 87 daily indexes
• 800Gb of data (per instance)
• Just bumped ES heap to 22G
 • Just writing data - 2Gb
 • Query over all indexes - 17Gb!
Elasticsearch size
• 87 daily indexes
• 800Gb of data (per instance)
• Just bumped ES heap to 22G
 • Just writing data - 2Gb
 • Query over all indexes - 17Gb!
• Hang on - 800/87 does not = 33Gb/day!
Rate has increased!


             Text
              Text



  We may have problems fitting
    onto 5 x 600Gb discs!
Standard log message
Standard event message
TimedWebRequest
TimedWebRequest
• Most obvious example of a standard event
 • App name
 • Environment
 • HTTP status
 • Page generation time
 • Request / Response size
TimedWebRequest
• Most obvious example of a standard event
 • App name
 • Environment
 • HTTP status
 • Page generation time
 • Request / Response size
• Can derive loads of metrics from this!
London devops logging
statsd
statsd
• Rolls up counters and timers into metrics
statsd
• Rolls up counters and timers into metrics
• One bucket per stat, emits values every 10
  seconds
statsd
• Rolls up counters and timers into metrics
• One bucket per stat, emits values every 10
  seconds
• Counters: Request rate, HTTP status rate
statsd
• Rolls up counters and timers into metrics
• One bucket per stat, emits values every 10
  seconds
• Counters: Request rate, HTTP status rate
• Timers: Total page time, mean page time,
  min/max page times
statsd
statsd
JSON everywhere
JSON everywhere

• Legacy shell ftp mirror scripts
• gitolite hooks for deployments
• keepalived health checks
JSON everywhere
echo "JSON:{"nagios_service":"${SERVICE}",
"nagios_status":"${STATUS_CODE}",
"message":"${STATUS_TEXT}"}" |
 logger -t nagios
Alerting
Alerting use cases:

• Replaced nsca client with standardised log
  pipeline
• Developers log an event and get (one!)
  email warning of client side exceptions
• Passive health monitoring - ‘did we log
  something recently’
Riemann
Riemann

• Using for some simple health checking
Riemann

• Using for some simple health checking
 • logcollector health
Riemann

• Using for some simple health checking
 • logcollector health
 • Load balancer instance health
Riemann
Riemann
• Ambitious plans to do more
Riemann
• Ambitious plans to do more
 • Web pool health (>= n nodes)
Riemann
• Ambitious plans to do more
 • Web pool health (>= n nodes)
 • Replace statsd
Riemann
• Ambitious plans to do more
 • Web pool health (>= n nodes)
 • Replace statsd
 • Transit collectd data via logstash and
    use to emit to graphite
Riemann
• Ambitious plans to do more
 • Web pool health (>= n nodes)
 • Replace statsd
 • Transit collectd data via logstash and
    use to emit to graphite
  • disc usage trending / prediction
Metadata
Metadata

• It’s all about the metadata
Metadata

• It’s all about the metadata
• Structured events are describable
Metadata

• It’s all about the metadata
• Structured events are describable
• Common patterns to give standard
  metrics / alerting for free
Metadata

• It’s all about the metadata
• Structured events are describable
• Common patterns to give standard
  metrics / alerting for free
• Dashboards!
Dashboard love/hate
Dashboard love/hate
• Riemann x 2
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
• Nagios x 3
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
• Nagios x 3
• CI radiator
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
• Nagios x 3
• CI radiator
Dashboard love/hate
• Riemann x 2
• Graphite dashboards x 2
• Nagios x 3
• CI radiator

• Information overload!
Thanks!

• Questions?

• slides with more detail about my log
  collector code:
  • http://slideshare.net/bobtfish/

More Related Content

PDF
Australian OpenStack User Group August 2012: Chef for OpenStack
PDF
Integrated Cache on Netscaler
PDF
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
KEY
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
PDF
Empowering developers to deploy their own data stores
PDF
I can't believe it's not a queue: Kafka and Spring
PDF
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
PDF
Supercharging Content Delivery with Varnish
Australian OpenStack User Group August 2012: Chef for OpenStack
Integrated Cache on Netscaler
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
Empowering developers to deploy their own data stores
I can't believe it's not a queue: Kafka and Spring
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
Supercharging Content Delivery with Varnish

What's hot (20)

PPTX
pgWALSync
PDF
Ansible v2 and Beyond (Ansible Hawai'i Meetup)
PDF
Async and Non-blocking IO w/ JRuby
PPTX
Learn you some Ansible for great good!
PPTX
Introduction to Apache Camel
PDF
Altitude SF 2017: Advanced VCL: Shielding and Clustering
KEY
DjangoCon 2010 Scaling Disqus
PPTX
Go Faster with Ansible (PHP meetup)
PDF
Ehcache 3: JSR-107 on steroids at Devoxx Morocco
PDF
Data Analytics Service Company and Its Ruby Usage
PPTX
Shall we play a game?
PDF
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
PPTX
ApacheCon EU 2016 - Apache Camel the integration library
PDF
Caching reboot: javax.cache & Ehcache 3
PDF
Use case for using the ElastiCache for Redis in production
PDF
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
PPTX
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
PDF
Caching the Uncacheable: Leveraging Your CDN to Cache Dynamic Content
PDF
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
PDF
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
pgWALSync
Ansible v2 and Beyond (Ansible Hawai'i Meetup)
Async and Non-blocking IO w/ JRuby
Learn you some Ansible for great good!
Introduction to Apache Camel
Altitude SF 2017: Advanced VCL: Shielding and Clustering
DjangoCon 2010 Scaling Disqus
Go Faster with Ansible (PHP meetup)
Ehcache 3: JSR-107 on steroids at Devoxx Morocco
Data Analytics Service Company and Its Ruby Usage
Shall we play a game?
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
ApacheCon EU 2016 - Apache Camel the integration library
Caching reboot: javax.cache & Ehcache 3
Use case for using the ElastiCache for Redis in production
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
Mitigating Security Threats with Fastly - Joe Williams at Fastly Altitude 2015
Caching the Uncacheable: Leveraging Your CDN to Cache Dynamic Content
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Ad

Viewers also liked (18)

PPTX
Elastic - ELK, Logstash & Kibana
PPTX
Tuning Elasticsearch Indexing Pipeline for Logs
PPTX
Introduction to BDD
PDF
Getting the fish (ball) in the net
PDF
Grafico diario del dax perfomance index para el 09 12-2011
PPTX
Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...
PPTX
Simple School Lunch Ideas
PPTX
気象庁発表の地震情報
PDF
off grid solar product UNIVPO
PDF
Disaster Risk Reduction
DOCX
8th grade founding father project[1]
PDF
JS非同期処理のいま
DOC
Zaragoza turismo 234
PPT
DNA of Automation - Sudeep Somani
PDF
Social Network Analysis Of Intangibles
PDF
PRywatki na Wykładzinie bez krawatów vol.1 - Po co dane w komunikacji w socia...
PPTX
Xsi unity pipeline
PDF
Faerie Glen Photos from Isle of Skye, Scotland - It's like visiting The Shire!
Elastic - ELK, Logstash & Kibana
Tuning Elasticsearch Indexing Pipeline for Logs
Introduction to BDD
Getting the fish (ball) in the net
Grafico diario del dax perfomance index para el 09 12-2011
Hashtags & Retweets: Using Twitter to aid Community, Communication and Casual...
Simple School Lunch Ideas
気象庁発表の地震情報
off grid solar product UNIVPO
Disaster Risk Reduction
8th grade founding father project[1]
JS非同期処理のいま
Zaragoza turismo 234
DNA of Automation - Sudeep Somani
Social Network Analysis Of Intangibles
PRywatki na Wykładzinie bez krawatów vol.1 - Po co dane w komunikacji w socia...
Xsi unity pipeline
Faerie Glen Photos from Isle of Skye, Scotland - It's like visiting The Shire!
Ad

Similar to London devops logging (20)

KEY
Message:Passing - lpw 2012
KEY
Messaging, interoperability and log aggregation - a new framework
KEY
From 100s to 100s of Millions
PDF
«Scrapy internals» Александр Сибиряков, Scrapinghub
KEY
Zero mq logs
PDF
Using Riak for Events storage and analysis at Booking.com
PDF
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
PDF
Cassandra Day Chicago 2015: Diagnosing Problems in Production
PDF
Cassandra Day London 2015: Diagnosing Problems in Production
KEY
Sphinx at Craigslist in 2012
PDF
Diagnosing Problems in Production (Nov 2015)
PDF
Advanced Operations
PDF
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
PPTX
Managing Security At 1M Events a Second using Elasticsearch
PDF
Diagnosing Problems in Production - Cassandra
PDF
Best practices for highly available and large scale SolrCloud
PDF
Realtime Analytics on AWS
KEY
Inside Of Mbga Open Platform
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
PDF
Elastic Data Analytics Platform @Datadog
Message:Passing - lpw 2012
Messaging, interoperability and log aggregation - a new framework
From 100s to 100s of Millions
«Scrapy internals» Александр Сибиряков, Scrapinghub
Zero mq logs
Using Riak for Events storage and analysis at Booking.com
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
Sphinx at Craigslist in 2012
Diagnosing Problems in Production (Nov 2015)
Advanced Operations
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Managing Security At 1M Events a Second using Elasticsearch
Diagnosing Problems in Production - Cassandra
Best practices for highly available and large scale SolrCloud
Realtime Analytics on AWS
Inside Of Mbga Open Platform
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Elastic Data Analytics Platform @Datadog

More from Tomas Doran (20)

PPTX
Long haul infrastructure: Failures and successes
PDF
Dockersh and a brief intro to the docker internals
PDF
Sensu and Sensibility - Puppetconf 2014
PDF
Steamlining your puppet development workflow
PDF
Building a smarter application stack - service discovery and wiring for Docker
PPT
Deploying puppet code at light speed
PDF
Thinking through puppet code layout
PDF
Docker puppetcamp london 2013
PDF
"The worst code I ever wrote"
PDF
Test driven infrastructure development (2 - puppetconf 2013 edition)
PDF
Test driven infrastructure development
PPT
London devops - orc
KEY
Webapp security testing
KEY
Webapp security testing
KEY
Dates aghhhh!!?!?!?!
KEY
Cooking a rabbit pie
KEY
High scale flavour
KEY
Large platform architecture in (mostly) perl - an illustrated tour
KEY
Large platform architecture in (mostly) perl
KEY
Web frameworks don't matter
Long haul infrastructure: Failures and successes
Dockersh and a brief intro to the docker internals
Sensu and Sensibility - Puppetconf 2014
Steamlining your puppet development workflow
Building a smarter application stack - service discovery and wiring for Docker
Deploying puppet code at light speed
Thinking through puppet code layout
Docker puppetcamp london 2013
"The worst code I ever wrote"
Test driven infrastructure development (2 - puppetconf 2013 edition)
Test driven infrastructure development
London devops - orc
Webapp security testing
Webapp security testing
Dates aghhhh!!?!?!?!
Cooking a rabbit pie
High scale flavour
Large platform architecture in (mostly) perl - an illustrated tour
Large platform architecture in (mostly) perl
Web frameworks don't matter

London devops logging

Editor's Notes

  • #2: \n
  • #3: \n
  • #4: \n
  • #5: \n
  • #6: \n
  • #7: \n
  • #8: \n
  • #9: \n
  • #10: \n
  • #11: \n
  • #12: \n
  • #13: \n
  • #14: \n
  • #15: \n
  • #16: \n
  • #17: \n
  • #18: \n
  • #19: \n
  • #20: The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • #21: The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • #22: The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • #23: The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • #24: The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • #25: The last point here is most important - ZMQ networking works entirely in a background thread perl knows nothing about, which means that you can asynchronously ship messages with no changes to your existing codebase.\n
  • #26: \n
  • #27: \n
  • #28: \n
  • #29: \n
  • #30: \n
  • #31: \n
  • #32: \n
  • #33: \n
  • #34: \n
  • #35: \n
  • #36: \n
  • #37: \n
  • #38: \n
  • #39: \n
  • #40: \n
  • #41: \n
  • #42: \n
  • #43: \n
  • #44: \n
  • #45: \n
  • #46: \n
  • #47: \n
  • #48: \n
  • #49: \n
  • #50: \n
  • #51: \n
  • #52: \n
  • #53: \n
  • #54: \n
  • #55: \n
  • #56: \n
  • #57: \n
  • #58: \n
  • #59: \n
  • #60: \n
  • #61: \n
  • #62: \n
  • #63: \n
  • #64: \n
  • #65: \n
  • #66: \n
  • #67: \n
  • #68: \n
  • #69: \n
  • #70: \n
  • #71: \n
  • #72: \n
  • #73: \n
  • #74: \n
  • #75: \n
  • #76: \n
  • #77: \n
  • #78: \n
  • #79: \n
  • #80: \n