SlideShare a Scribd company logo
Nighthawk
Distributed caching with Redis @
Twitter
Rashmi Ramesh
@rashmi_ur
Agenda
What is Nighthawk?
How does it work?
Scaling out
High availability
Current challenges
Nighthawk - cache-as-a-service
Runs redis at it’s core
> 10M QPS,
Largest cluster runs ~3K redis nodes
> 10TB of data
Who uses Nighthawk?
Some of our biggest customers:
Analytics services - Ads, Video
Ad serving
Ad Exchange
Direct Messaging
Mobile app conversion tracking
Design Goals
Scalable: scale vertically and horizontally
Elastic: add / remove instances without violating SLA
High throughput and low latencies
High availability in the event of machine failures
Topology agnostic client
Nighthawk Architecture
Client
Proxy/Routing layer
Backend N
..……...
Redis 0 Redis N
Backend 0
..……...
Redis 0 Redis N
Topology
Cluster
manager
Cache backend
Mesos Container
Redis nodes
Topology
watcher and
announcer
1 2 3
NM
Proxy/Router
Replica 1 -> Redis1
Replica 2 -> Redis2
Replica 3 -> Redis3
Redis1(dc,host,port1,capacity)
Redis2(dc,host,port2, capacity)
Redis3(dc,host,port3,, capacity)
Topology
Cluster manager
Manages topology membership and changes
- (Re)Balances replicas
- Reacts to topology changes, eg: dead node
- Replicated cache - ensures 2 replicas of same partition are on separate
failure domains
Redis databases for partitions
Partition -> Redis DB
Granular key remapping
Logical data isolation
Enumerating - redis db scan
Deletion - flushdb
Enables replica rehydration
K1 K4K2 K3
Partition X Partition Y
1 2
Scaling
Scaling out with Client/Proxy managed
partitioningKey count: 1.5 M keys
Client
500K 500K500K
Scaling out with Client/Proxy managed
partitioningKey count: 1.5M keys
Remapped keys: 600K
Client
300K 300K300K 300K
300K
Persistent storage
Scaling out with Cluster manager
Key count: 1.5M keys
Partition count: 100
Keys/Partition: 15K
Client
Persistent storage
Proxy
Topology and
cluster manager
500K 500K500K
Scaling out with Cluster manager
Key count: 1.5M keys
Partition count: 100
Keys/Partition: 15K
Client
Persistent storage
Proxy
Topology and
cluster manager
500K 485K500K 15K
Scaling out with Cluster manager
Key count: 1.5M keys
Partition count: 100
Keys/Partition: 15K
Client
485K 485K500K 15K 15K
Persistent storage
Proxy
Topology and
cluster manager
Scaling out with Cluster manager - Post
balancingKey count: 1.5M keys
Partition count: 100
Post balancing...
Client
Persistent storage
Proxy
Topology and
cluster manager
250K 250K250K 250K 500K
Advantages over Client managed partitioning
- Thin client - simple and oblivious to topology
- Clients, proxy layer and backends scale independently
- Pluggable custom load balancing logic through cluster manager
- No cluster downtime during scaling out/up/back
High Availability
High Availability with Replication
Synchronous, best effort
RF = 2, Intra DC
Supports idempotent operations only - get, put, remove, count, scan
Copies of a partition never on the same host and rack
Passive warming for failed/restarted replicas
High Availability with Replication
Client
Proxy/Routing layer
Backend 0
Partition 2,5,9
Topology
Cluster
manager
GetKey in
Partition 5
GetKey in
Partition 5
SERVING
Backend N
Partition
12,5,10
SERVINGFAILED
Backend N*
Partition 12,5,10
WARMING
SetKey in
partition 5
Pool A Pool B
Current challenges
Remember this?
The most retweeted
Tweet of 2014!
Hot key symptom
Significantly high QPS to a single cache server
Hot Key Mitigation
Server side diagnostics:
Sampling a small % of requests and logging
Post processing the logs to identify high frequency keys
Client side solution:
Client side hot key detection and caching
Better to have:
Redis tracks the hot keys
Protocol support to send feedback to client if a key is hot
Active warming of replicas
Client
Proxy/Routing layer
Topology
Cluster
manager
Backend A
Partition 2,5,9
SERVING
Backend B*
Partition 12,5,10
WARMING
writes
Bootstrapper
Pool A
Pool B
Questions?

More Related Content

PDF
Patterns of resilience
PDF
RocksDB Performance and Reliability Practices
PPTX
Resilience reloaded - more resilience patterns
PDF
Introducing the Apache Flink Kubernetes Operator
PDF
Distributed Lock Manager
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PPTX
RocksDB detail
PDF
When NOT to use Apache Kafka?
Patterns of resilience
RocksDB Performance and Reliability Practices
Resilience reloaded - more resilience patterns
Introducing the Apache Flink Kubernetes Operator
Distributed Lock Manager
Tame the small files problem and optimize data layout for streaming ingestion...
RocksDB detail
When NOT to use Apache Kafka?

What's hot (20)

PDF
Common issues with Apache Kafka® Producer
PDF
Introduction to Kafka Streams
ODP
Stream processing using Kafka
PDF
Kibana + timelion: time series with the elastic stack
PDF
Producer Performance Tuning for Apache Kafka
KEY
Introduction to memcached
PDF
Cassandra Introduction & Features
PPTX
Apache Kafka Best Practices
PDF
A Deep Dive into Kafka Controller
PPTX
Apache Spark Architecture
PDF
Apache Kafka Introduction
PPTX
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
PDF
Fundamentals of Apache Kafka
PPTX
Introduction to Apache ZooKeeper
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Introduction to Redis
PPTX
Introduction to Apache Kafka
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Common issues with Apache Kafka® Producer
Introduction to Kafka Streams
Stream processing using Kafka
Kibana + timelion: time series with the elastic stack
Producer Performance Tuning for Apache Kafka
Introduction to memcached
Cassandra Introduction & Features
Apache Kafka Best Practices
A Deep Dive into Kafka Controller
Apache Spark Architecture
Apache Kafka Introduction
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Fundamentals of Apache Kafka
Introduction to Apache ZooKeeper
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Evening out the uneven: dealing with skew in Flink
Introduction to Redis
Introduction to Apache Kafka
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Ad

Similar to RedisConf17- Using Redis at scale @ Twitter (20)

PDF
HandsOn ProxySQL Tutorial - PLSC18
PDF
Nutanix - The Next Level in Web Scale IT Architectures is Here
PPS
Hp Integrity Servers
PPS
WETEC HP Integrity Servers
PPTX
Large scale, distributed access management deployment with aruba clear pass
PPTX
HP Storage: Delivering Storage without Boundaries
PDF
TechTalkThai-CiscoHyperFlex
PPTX
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
PDF
Perforce Server: The Next Generation
PPTX
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
PDF
TechTarget Event - Storage Architectures for the Modern Data Centre – Martin ...
PDF
HPC DAY 2017 | HPE Storage and Data Management for Big Data
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
PDF
What's new in confluent platform 5.4 online talk
PPTX
RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...
PDF
Techmeeting-17feb2016
PPT
MYSQL
PDF
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
PDF
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
PPT
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
HandsOn ProxySQL Tutorial - PLSC18
Nutanix - The Next Level in Web Scale IT Architectures is Here
Hp Integrity Servers
WETEC HP Integrity Servers
Large scale, distributed access management deployment with aruba clear pass
HP Storage: Delivering Storage without Boundaries
TechTalkThai-CiscoHyperFlex
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
Perforce Server: The Next Generation
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
TechTarget Event - Storage Architectures for the Modern Data Centre – Martin ...
HPC DAY 2017 | HPE Storage and Data Management for Big Data
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
What's new in confluent platform 5.4 online talk
RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...
Techmeeting-17feb2016
MYSQL
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Ad

More from Redis Labs (20)

PPTX
Redis Day Bangalore 2020 - Session state caching with redis
PPTX
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
PPTX
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
PPTX
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
PPTX
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
PPTX
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
PPTX
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
PPTX
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
PPTX
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
PPTX
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
PPTX
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
PPTX
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
PPTX
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
PPTX
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
PPTX
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
PPTX
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
PDF
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
PPTX
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Redis Day Bangalore 2020 - Session state caching with redis
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Approach and Philosophy of On baking technology
PPTX
Machine Learning_overview_presentation.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
Group 1 Presentation -Planning and Decision Making .pptx
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Approach and Philosophy of On baking technology
Machine Learning_overview_presentation.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Spectral efficient network and resource selection model in 5G networks
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Programs and apps: productivity, graphics, security and other tools
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation

RedisConf17- Using Redis at scale @ Twitter

  • 1. Nighthawk Distributed caching with Redis @ Twitter Rashmi Ramesh @rashmi_ur
  • 2. Agenda What is Nighthawk? How does it work? Scaling out High availability Current challenges
  • 3. Nighthawk - cache-as-a-service Runs redis at it’s core > 10M QPS, Largest cluster runs ~3K redis nodes > 10TB of data
  • 4. Who uses Nighthawk? Some of our biggest customers: Analytics services - Ads, Video Ad serving Ad Exchange Direct Messaging Mobile app conversion tracking
  • 5. Design Goals Scalable: scale vertically and horizontally Elastic: add / remove instances without violating SLA High throughput and low latencies High availability in the event of machine failures Topology agnostic client
  • 6. Nighthawk Architecture Client Proxy/Routing layer Backend N ..……... Redis 0 Redis N Backend 0 ..……... Redis 0 Redis N Topology Cluster manager
  • 7. Cache backend Mesos Container Redis nodes Topology watcher and announcer 1 2 3 NM Proxy/Router Replica 1 -> Redis1 Replica 2 -> Redis2 Replica 3 -> Redis3 Redis1(dc,host,port1,capacity) Redis2(dc,host,port2, capacity) Redis3(dc,host,port3,, capacity) Topology
  • 8. Cluster manager Manages topology membership and changes - (Re)Balances replicas - Reacts to topology changes, eg: dead node - Replicated cache - ensures 2 replicas of same partition are on separate failure domains
  • 9. Redis databases for partitions Partition -> Redis DB Granular key remapping Logical data isolation Enumerating - redis db scan Deletion - flushdb Enables replica rehydration K1 K4K2 K3 Partition X Partition Y 1 2
  • 11. Scaling out with Client/Proxy managed partitioningKey count: 1.5 M keys Client 500K 500K500K
  • 12. Scaling out with Client/Proxy managed partitioningKey count: 1.5M keys Remapped keys: 600K Client 300K 300K300K 300K 300K Persistent storage
  • 13. Scaling out with Cluster manager Key count: 1.5M keys Partition count: 100 Keys/Partition: 15K Client Persistent storage Proxy Topology and cluster manager 500K 500K500K
  • 14. Scaling out with Cluster manager Key count: 1.5M keys Partition count: 100 Keys/Partition: 15K Client Persistent storage Proxy Topology and cluster manager 500K 485K500K 15K
  • 15. Scaling out with Cluster manager Key count: 1.5M keys Partition count: 100 Keys/Partition: 15K Client 485K 485K500K 15K 15K Persistent storage Proxy Topology and cluster manager
  • 16. Scaling out with Cluster manager - Post balancingKey count: 1.5M keys Partition count: 100 Post balancing... Client Persistent storage Proxy Topology and cluster manager 250K 250K250K 250K 500K
  • 17. Advantages over Client managed partitioning - Thin client - simple and oblivious to topology - Clients, proxy layer and backends scale independently - Pluggable custom load balancing logic through cluster manager - No cluster downtime during scaling out/up/back
  • 19. High Availability with Replication Synchronous, best effort RF = 2, Intra DC Supports idempotent operations only - get, put, remove, count, scan Copies of a partition never on the same host and rack Passive warming for failed/restarted replicas
  • 20. High Availability with Replication Client Proxy/Routing layer Backend 0 Partition 2,5,9 Topology Cluster manager GetKey in Partition 5 GetKey in Partition 5 SERVING Backend N Partition 12,5,10 SERVINGFAILED Backend N* Partition 12,5,10 WARMING SetKey in partition 5 Pool A Pool B
  • 22. Remember this? The most retweeted Tweet of 2014!
  • 23. Hot key symptom Significantly high QPS to a single cache server
  • 24. Hot Key Mitigation Server side diagnostics: Sampling a small % of requests and logging Post processing the logs to identify high frequency keys Client side solution: Client side hot key detection and caching Better to have: Redis tracks the hot keys Protocol support to send feedback to client if a key is hot
  • 25. Active warming of replicas Client Proxy/Routing layer Topology Cluster manager Backend A Partition 2,5,9 SERVING Backend B* Partition 12,5,10 WARMING writes Bootstrapper Pool A Pool B

Editor's Notes

  • #4: Each major service gets it’s own cache cluster. 2 modes of operation - replicated and non replicated.
  • #5: Analytics services - Ads, Video - Ad engagement analytics, video ad engagement analytics Mobile app conversion tracking - tracks conversions like promoted app installs, in-app purchases and signups Ad serving - performs ad matching, scoring, and serving Ad Exchange - real time bidding for ads DM - direct messaging Interaction metrics service - provides different types of engagement metrics by tweet or by user
  • #7: Routing layer subscribes to topology changes and updates it’s current mapping of partition to redis node. For every request, it hashes the key and finds out which partition the key belongs to. It then figures which redis node it is mapped to and forwards the request to the appropriate redis. Each backend can have 1 or more redises. Since redis is single threaded, to increase throughput per container and fully utilize the resources allocated to the container- like bandwidth, CPU, RAM, the backend can have more than 1 redis. The backends also have a topology component that announces the currently running redis nodes. The cluster manager is in charge of creating partitions and managing topology. It is responsible for balancing replicas of partitions evenly across nodes, ensuring no replicas of the same partition are not down at the same time during managed data movement, ensuring dead nodes are removed from the topology after the partitions assigned to them have been successfully assigned to currenty available nodes. It also takes care of rate limited data movement from current nodes to newly joined nodes ensuring clients don’t see a huge number of cache misses as soon as the cluster is expanded. Trade off: Additional hop in proxy layer - for a topology agnostic client
  • #8: Runs in mesos containers Can have 1 or more redis instances running in each container Number of redis nodes per container - bound by server resources, amount of data to be store and data density per node. Announces information about the redis instances running to the topology Information: DC, host, port, device type, capacity … Capacity of a node - also can be referred to as weight - refers to how much data can be stored Watches and reacts to topology changes like new replica assigned to a local redis, or replica moving to a remote redis.
  • #9: Manages all the participants in the topology and maintains the sanity of the cluster Ensures every partition has a replica residing on an available node Balances replicas/partitions across nodes of the cluster. If nodes have different capacity, the number of replicas assigned to the nodes are proportional to their capacity
  • #10: Unit of data movement is much smaller - Moving 1/N keys in a redis vs a db in redis Moving a replica/partition is dropping all keys in a db in one redis and remapping the keys to another db in another redis
  • #13: Adding new nodes right away, causes Count(Keys)/Count(Nodes) to get remapped and will see a cache miss for those requests, hitting hard on the persistent storage. If proper checks and balances exist, persistent storage will rate limit the requests, or just serve with higher latencies and degraded throughput. In either case, clients will see errors and hit timeouts, thus undergoes Success rate degradation. There is no intelligent balancing if there is a higher config redis node, unless your have some sort of balancing logic inside the client. What an overload!
  • #14: If proxy layer is the bottleneck, you can add more proxy instances. If backends are the bottleneck, you can add more backends.
  • #16: Your persistent storage and the storage team will thank you for rate limiting how much traffic you send to it.
  • #17: State of the partitioning at the end of balancing.
  • #18: Topology schemes - you could use ZK in combination with consistent hashing, or maintain a changelog to store topology, or move to a totally different method for representing and storing topology. Clients don’t need to know about it. CLients don’t have to worry about replication factor, or how replication happens. New Administrative workflows can be added - automating rolling restart, node maintenance, migration with the help of CM.
  • #19: Why use replication? Data analytics pipeline Need to store real time data that have a relatively shorter lifetime (until batch jobs catch up) Computations are expensive to recompute on cache-miss User session data for current day Data lifetime of a day Expensive to store in a persistent key value store for the desired latency/throughput requirements Serves business goals for half the cost with better latencies.
  • #20: Trade offs RF > 2, adds to latency and cost Non idempotent operations not supported - incr/ decr
  • #21: Show writes when both are serving.
  • #23: Hot keys: Ellen’s tweet is a classic example of how a popular key snowballs into a hotkey. Key that gets a disproportionately high number of QPS. Manifests as a very busy cache server, slowing it down further, can result in b/w saturation if the value is large, and can result in packet drops, and client side timeouts.
  • #26: Quickly re-populating a warming replica using a serving copy Easy solution: Do nothing, rely on organic population of data on writes A better solution: Read data from a serving replica and write to the warming replica Rate limit copy to not impact production traffic latency and throughput