SlideShare a Scribd company logo
Building a Real-Time Data Pipeline:
Apache Kafka at Linkedin
Hadoop Summit 2013
Joel Koshy
June 2013
LinkedIn Corporation ©2013 All Rights Reserved
HADOOP SUMMIT 2013
Network update stream
LinkedIn Corporation ©2013 All Rights Reserved
We have a lot of data.
We want to leverage this data to build products.
Data pipeline
HADOOP SUMMIT 2013
People you may know
HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5
How do we integrate this variety of data
and make it available to all these systems?
LinkedIn Confidential ©2013 All Rights Reserved
HADOOP SUMMIT 2013
Point-to-point pipelines
HADOOP SUMMIT 2013
LinkedIn’s user activity data pipeline (circa 2010)
HADOOP SUMMIT 2013
Point-to-point pipelines
HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 10
HADOOP SUMMIT 2013
Central data pipeline
First attempt: don’t re-invent the wheel
LinkedIn Confidential ©2013 All Rights Reserved
HADOOP SUMMIT 2013
Second attempt: re-invent the wheel!
LinkedIn Confidential ©2013 All Rights Reserved
Use a central commit log
LinkedIn Confidential ©2013 All Rights Reserved
HADOOP SUMMIT 2013
What is a commit log?
HADOOP SUMMIT 2013
The log as a messaging system
LinkedIn Corporation ©2013 All Rights Reserved 17
HADOOP SUMMIT 2013
Apache Kafka
LinkedIn Corporation ©2013 All Rights Reserved 18
HADOOP SUMMIT 2013
Usage at LinkedIn
 16 brokers in each cluster
 28 billion messages/day
 Peak rates
– Writes: 460,000 messages/second
– Reads: 2,300,000 messages/second
 ~ 700 topics
 40-50 live services consuming user-activity data
 Many ad hoc consumers
 Every production service is a producer (for metrics)
 10k connections/colo
LinkedIn Corporation ©2013 All Rights Reserved 19
HADOOP SUMMIT 2013
Usage at LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 20
HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 21
HADOOP SUMMIT 2013
Standardize on Avro in data pipeline
LinkedIn Corporation ©2013 All Rights Reserved 22
{
"type": "record",
"name": "URIValidationRequestEvent",
"namespace": "com.linkedin.event.usv",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": ”TrackingEventHeader",
"namespace": "com.linkedin.event",
"fields": [
{
"name": "memberId",
"type": "int",
"doc": "The member id of the user initiating the action"
},
{
"name": ”timeMs",
"type": "long",
"doc": "The time of the event"
},
{
"name": ”host",
"type": "string",
...
...
HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 23
HADOOP SUMMIT 2013
Hadoop data load (Camus)
 Open sourced:
– https://github.com/linkedin/camus
 One job loads all events
 ~10 minute ETA on average from producer to HDFS
 Hive registration done automatically
 Schema evolution handled transparently
HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 25
Does it work?
“All published messages must be delivered to all consumers (quickly)”
LinkedIn Confidential ©2013 All Rights Reserved
HADOOP SUMMIT 2013
Audit Trail
HADOOP SUMMIT 2013
Kafka replication (0.8)
 Intra-cluster replication feature
– Facilitates high availability and durability
 Beta release available
https://dist.apache.org/repos/dist/release/kafka/
 Rolled out in production at LinkedIn last week
LinkedIn Corporation ©2013 All Rights Reserved 28
HADOOP SUMMIT 2013
Join us at our user-group meeting tonight @ LinkedIn!
– Thursday, June 27, 7.30pm to 9.30pm
– 2025 Stierlin Ct., Mountain View, CA
– http://www.meetup.com/http-kafka-apache-org/events/125887332/
– Presentations (replication overview and use-case studies) from:
 RichRelevance
 Netflix
 Square
 LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 29
HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30

More Related Content

What's hot (20)

PPTX
Node.js Express
Eyal Vardi
 
PDF
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
PPTX
Web api
Sudhakar Sharma
 
PPTX
[211] HBase 기반 검색 데이터 저장소 (공개용)
NAVER D2
 
PPTX
Apache Hive Tutorial
Sandeep Patil
 
PDF
Intro to HBase
alexbaranau
 
PDF
How to Avoid Common Mistakes When Using Reactor Netty
VMware Tanzu
 
PPTX
Apache Spark
SugumarSarDurai
 
PPTX
Master Real-Time Streams With Neo4j and Apache Kafka
Neo4j
 
PPTX
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
DataStax
 
PPTX
Unit 5-apache hive
vishal choudhary
 
PDF
ProxySQL High Availability (Clustering)
Mydbops
 
PDF
Introduction to MongoDB
Mike Dirolf
 
PDF
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
PDF
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
PDF
Altinity Quickstart for ClickHouse
Altinity Ltd
 
PPTX
Introduction to Sharding
MongoDB
 
PPTX
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
PDF
CDC Stream Processing with Apache Flink
Timo Walther
 
PPTX
mongodb와 mysql의 CRUD 연산의 성능 비교
Woo Yeong Choi
 
Node.js Express
Eyal Vardi
 
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
NAVER D2
 
Apache Hive Tutorial
Sandeep Patil
 
Intro to HBase
alexbaranau
 
How to Avoid Common Mistakes When Using Reactor Netty
VMware Tanzu
 
Apache Spark
SugumarSarDurai
 
Master Real-Time Streams With Neo4j and Apache Kafka
Neo4j
 
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
DataStax
 
Unit 5-apache hive
vishal choudhary
 
ProxySQL High Availability (Clustering)
Mydbops
 
Introduction to MongoDB
Mike Dirolf
 
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
Altinity Quickstart for ClickHouse
Altinity Ltd
 
Introduction to Sharding
MongoDB
 
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
CDC Stream Processing with Apache Flink
Timo Walther
 
mongodb와 mysql의 CRUD 연산의 성능 비교
Woo Yeong Choi
 

Viewers also liked (20)

PPTX
Architecture of a Kafka camus infrastructure
mattlieber
 
PPTX
Data Infrastructure at LinkedIn
Amy W. Tang
 
PPTX
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
PPT
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
PPTX
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Amy W. Tang
 
PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang
 
PDF
Data Infrastructure at LinkedIn
Amy W. Tang
 
PDF
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
Amy W. Tang
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PDF
LinkedIn Communication Architecture
LinkedIn
 
PDF
Introduction to Databus
Amy W. Tang
 
PDF
Building Distributed Systems Using Helix
Amy W. Tang
 
PDF
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
PDF
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella
 
PDF
Rakuten LeoFs - distributed file system
Rakuten Group, Inc.
 
PDF
Introduction to apache kafka
Samuel Kerrien
 
PPTX
Apache Kafka
Maher TEBOURBI
 
PPTX
Realtime streaming architecture in INFINARIO
Jozo Kovac
 
PDF
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
In-Memory Computing Summit
 
PPTX
Intro to SnappyData Webinar
SnappyData
 
Architecture of a Kafka camus infrastructure
mattlieber
 
Data Infrastructure at LinkedIn
Amy W. Tang
 
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang
 
Data Infrastructure at LinkedIn
Amy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
Amy W. Tang
 
Introduction to Apache Kafka
Jeff Holoman
 
LinkedIn Communication Architecture
LinkedIn
 
Introduction to Databus
Amy W. Tang
 
Building Distributed Systems Using Helix
Amy W. Tang
 
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella
 
Rakuten LeoFs - distributed file system
Rakuten Group, Inc.
 
Introduction to apache kafka
Samuel Kerrien
 
Apache Kafka
Maher TEBOURBI
 
Realtime streaming architecture in INFINARIO
Jozo Kovac
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
In-Memory Computing Summit
 
Intro to SnappyData Webinar
SnappyData
 
Ad

Similar to Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn (20)

PPTX
Apache Kafka at LinkedIn
Guozhang Wang
 
PPTX
How Linkedin uses Automic for Big Data Processes
CA | Automic Software
 
PPTX
The "Big Data" Ecosystem at LinkedIn
Sam Shah
 
PDF
The "Big Data" Ecosystem at LinkedIn
Sam Shah
 
PDF
The “Big Data” Ecosystem at LinkedIn
Kun Le
 
PPTX
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
DataWorks Summit
 
PPTX
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
PDF
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
Gaurav "GP" Pal
 
PPTX
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
PPTX
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Mitul Tiwari
 
PDF
Software Development & Architecture @ LinkedIn
C4Media
 
PPTX
Data Process Systems, connecting everything
DataWorks Summit/Hadoop Summit
 
PPT
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
PPTX
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
confluent
 
PPTX
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Dong Lin
 
PPTX
Hadoop Big Data A big picture
J S Jodha
 
PPTX
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PPTX
Real time monitoring of hadoop and spark workflows
Shankar Manian
 
Apache Kafka at LinkedIn
Guozhang Wang
 
How Linkedin uses Automic for Big Data Processes
CA | Automic Software
 
The "Big Data" Ecosystem at LinkedIn
Sam Shah
 
The "Big Data" Ecosystem at LinkedIn
Sam Shah
 
The “Big Data” Ecosystem at LinkedIn
Kun Le
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
DataWorks Summit
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
Gaurav "GP" Pal
 
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Mitul Tiwari
 
Software Development & Architecture @ LinkedIn
C4Media
 
Data Process Systems, connecting everything
DataWorks Summit/Hadoop Summit
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
confluent
 
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Dong Lin
 
Hadoop Big Data A big picture
J S Jodha
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Real time monitoring of hadoop and spark workflows
Shankar Manian
 
Ad

More from Amy W. Tang (6)

PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Amy W. Tang
 
PDF
LinkedIn Graph Presentation
Amy W. Tang
 
PDF
Data Infrastructure at LinkedIn
Amy W. Tang
 
PDF
Voldemort on Solid State Drives
Amy W. Tang
 
PDF
Untangling Cluster Management with Helix
Amy W. Tang
 
PDF
All Aboard the Databus
Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Amy W. Tang
 
LinkedIn Graph Presentation
Amy W. Tang
 
Data Infrastructure at LinkedIn
Amy W. Tang
 
Voldemort on Solid State Drives
Amy W. Tang
 
Untangling Cluster Management with Helix
Amy W. Tang
 
All Aboard the Databus
Amy W. Tang
 

Recently uploaded (20)

PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PDF
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
PDF
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
Python Conference Singapore - 19 Jun 2025
ninefyi
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Kubernetes - Architecture & Components.pdf
geethak285
 
PDF
UiPath Agentic AI ile Akıllı Otomasyonun Yeni Çağı
UiPathCommunity
 
PDF
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
PPTX
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Python Conference Singapore - 19 Jun 2025
ninefyi
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Kubernetes - Architecture & Components.pdf
geethak285
 
UiPath Agentic AI ile Akıllı Otomasyonun Yeni Çağı
UiPathCommunity
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

  • 1. Building a Real-Time Data Pipeline: Apache Kafka at Linkedin Hadoop Summit 2013 Joel Koshy June 2013 LinkedIn Corporation ©2013 All Rights Reserved
  • 3. LinkedIn Corporation ©2013 All Rights Reserved We have a lot of data. We want to leverage this data to build products. Data pipeline
  • 5. HADOOP SUMMIT 2013 System and application metrics/logging LinkedIn Corporation ©2013 All Rights Reserved 5
  • 6. How do we integrate this variety of data and make it available to all these systems? LinkedIn Confidential ©2013 All Rights Reserved
  • 8. HADOOP SUMMIT 2013 LinkedIn’s user activity data pipeline (circa 2010)
  • 10. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 10
  • 11. HADOOP SUMMIT 2013 Central data pipeline
  • 12. First attempt: don’t re-invent the wheel LinkedIn Confidential ©2013 All Rights Reserved
  • 14. Second attempt: re-invent the wheel! LinkedIn Confidential ©2013 All Rights Reserved
  • 15. Use a central commit log LinkedIn Confidential ©2013 All Rights Reserved
  • 16. HADOOP SUMMIT 2013 What is a commit log?
  • 17. HADOOP SUMMIT 2013 The log as a messaging system LinkedIn Corporation ©2013 All Rights Reserved 17
  • 18. HADOOP SUMMIT 2013 Apache Kafka LinkedIn Corporation ©2013 All Rights Reserved 18
  • 19. HADOOP SUMMIT 2013 Usage at LinkedIn  16 brokers in each cluster  28 billion messages/day  Peak rates – Writes: 460,000 messages/second – Reads: 2,300,000 messages/second  ~ 700 topics  40-50 live services consuming user-activity data  Many ad hoc consumers  Every production service is a producer (for metrics)  10k connections/colo LinkedIn Corporation ©2013 All Rights Reserved 19
  • 20. HADOOP SUMMIT 2013 Usage at LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 20
  • 21. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 21
  • 22. HADOOP SUMMIT 2013 Standardize on Avro in data pipeline LinkedIn Corporation ©2013 All Rights Reserved 22 { "type": "record", "name": "URIValidationRequestEvent", "namespace": "com.linkedin.event.usv", "fields": [ { "name": "header", "type": { "type": "record", "name": ”TrackingEventHeader", "namespace": "com.linkedin.event", "fields": [ { "name": "memberId", "type": "int", "doc": "The member id of the user initiating the action" }, { "name": ”timeMs", "type": "long", "doc": "The time of the event" }, { "name": ”host", "type": "string", ... ...
  • 23. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 23
  • 24. HADOOP SUMMIT 2013 Hadoop data load (Camus)  Open sourced: – https://github.com/linkedin/camus  One job loads all events  ~10 minute ETA on average from producer to HDFS  Hive registration done automatically  Schema evolution handled transparently
  • 25. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 25
  • 26. Does it work? “All published messages must be delivered to all consumers (quickly)” LinkedIn Confidential ©2013 All Rights Reserved
  • 28. HADOOP SUMMIT 2013 Kafka replication (0.8)  Intra-cluster replication feature – Facilitates high availability and durability  Beta release available https://dist.apache.org/repos/dist/release/kafka/  Rolled out in production at LinkedIn last week LinkedIn Corporation ©2013 All Rights Reserved 28
  • 29. HADOOP SUMMIT 2013 Join us at our user-group meeting tonight @ LinkedIn! – Thursday, June 27, 7.30pm to 9.30pm – 2025 Stierlin Ct., Mountain View, CA – http://www.meetup.com/http-kafka-apache-org/events/125887332/ – Presentations (replication overview and use-case studies) from:  RichRelevance  Netflix  Square  LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 29
  • 30. HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30