Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

Jul 2, 201339 likes11,324 views

The document discusses LinkedIn's implementation of a real-time data pipeline using Apache Kafka, emphasizing the need to leverage large volumes of data for product development. Key strategies include using a central data pipeline, enforcing data cleanliness, optimizing ETL processes, and ensuring evidence-based correctness. It details Kafka's performance at LinkedIn, reporting billions of messages processed daily across numerous services.

Technology Business

Building a Real-Time Data Pipeline:
Apache Kafka at Linkedin
Hadoop Summit 2013
Joel Koshy
June 2013
LinkedIn Corporation ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Network update stream

LinkedIn Corporation ©2013 All Rights Reserved
We have a lot of data.
We want to leverage this data to build products.
Data pipeline

HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5

How do we integrate this variety of data
and make it available to all these systems?
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Point-to-point pipelines

HADOOP SUMMIT 2013
LinkedIn’s user activity data pipeline (circa 2010)

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 10

HADOOP SUMMIT 2013
Central data pipeline

First attempt: don’t re-invent the wheel
LinkedIn Confidential ©2013 All Rights Reserved

Second attempt: re-invent the wheel!
LinkedIn Confidential ©2013 All Rights Reserved

Use a central commit log
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
What is a commit log?

HADOOP SUMMIT 2013
The log as a messaging system
LinkedIn Corporation ©2013 All Rights Reserved 17

HADOOP SUMMIT 2013
Apache Kafka
LinkedIn Corporation ©2013 All Rights Reserved 18

HADOOP SUMMIT 2013
Usage at LinkedIn
 16 brokers in each cluster
 28 billion messages/day
 Peak rates
– Writes: 460,000 messages/second
– Reads: 2,300,000 messages/second
 ~ 700 topics
 40-50 live services consuming user-activity data
 Many ad hoc consumers
 Every production service is a producer (for metrics)
 10k connections/colo
LinkedIn Corporation ©2013 All Rights Reserved 19

HADOOP SUMMIT 2013
Usage at LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 20

HADOOP SUMMIT 2013
Standardize on Avro in data pipeline
LinkedIn Corporation ©2013 All Rights Reserved 22
{
"type": "record",
"name": "URIValidationRequestEvent",
"namespace": "com.linkedin.event.usv",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": ”TrackingEventHeader",
"namespace": "com.linkedin.event",
"fields": [
{
"name": "memberId",
"type": "int",
"doc": "The member id of the user initiating the action"
},
{
"name": ”timeMs",
"type": "long",
"doc": "The time of the event"
},
{
"name": ”host",
"type": "string",
...
...

HADOOP SUMMIT 2013
Hadoop data load (Camus)
 Open sourced:
– https://github.com/linkedin/camus
 One job loads all events
 ~10 minute ETA on average from producer to HDFS
 Hive registration done automatically
 Schema evolution handled transparently

Does it work?
“All published messages must be delivered to all consumers (quickly)”
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Kafka replication (0.8)
 Intra-cluster replication feature
– Facilitates high availability and durability
 Beta release available
https://dist.apache.org/repos/dist/release/kafka/
 Rolled out in production at LinkedIn last week
LinkedIn Corporation ©2013 All Rights Reserved 28

HADOOP SUMMIT 2013
Join us at our user-group meeting tonight @ LinkedIn!
– Thursday, June 27, 7.30pm to 9.30pm
– 2025 Stierlin Ct., Mountain View, CA
– http://www.meetup.com/http-kafka-apache-org/events/125887332/
– Presentations (replication overview and use-case studies) from:
 RichRelevance
 Netflix
 Square
 LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 29

HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30

More Related Content

What's hot (20)

PPTX

Node.js ExpressEyal Vardi

PDF

Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit

PPTX

Web apiSudhakar Sharma

PPTX

[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2

PPTX

Apache Hive TutorialSandeep Patil

PDF

Intro to HBasealexbaranau

PDF

How to Avoid Common Mistakes When Using Reactor NettyVMware Tanzu

PPTX

Apache SparkSugumarSarDurai

PPTX

Master Real-Time Streams With Neo4j and Apache KafkaNeo4j

PPTX

Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...DataStax

PPTX

Unit 5-apache hivevishal choudhary

PDF

ProxySQL High Availability (Clustering)Mydbops

PDF

Introduction to MongoDBMike Dirolf

PDF

Apache Calcite (a tutorial given at BOSS '21)Julian Hyde

PDF

[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT

PDF

Altinity Quickstart for ClickHouseAltinity Ltd

PPTX

Introduction to ShardingMongoDB

PPTX

Airflow를 이용한 데이터 Workflow 관리YoungHeon (Roy) Kim

PDF

CDC Stream Processing with Apache FlinkTimo Walther

PPTX

mongodb와 mysql의 CRUD 연산의 성능 비교Woo Yeong Choi

Node.js ExpressEyal Vardi

Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit

Web apiSudhakar Sharma

[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2

Apache Hive TutorialSandeep Patil

Intro to HBasealexbaranau

How to Avoid Common Mistakes When Using Reactor NettyVMware Tanzu

Apache SparkSugumarSarDurai

Master Real-Time Streams With Neo4j and Apache KafkaNeo4j

Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...DataStax

Unit 5-apache hivevishal choudhary

ProxySQL High Availability (Clustering)Mydbops

Introduction to MongoDBMike Dirolf

Apache Calcite (a tutorial given at BOSS '21)Julian Hyde

[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT

Altinity Quickstart for ClickHouseAltinity Ltd

Introduction to ShardingMongoDB

Airflow를 이용한 데이터 Workflow 관리YoungHeon (Roy) Kim

CDC Stream Processing with Apache FlinkTimo Walther

mongodb와 mysql의 CRUD 연산의 성능 비교Woo Yeong Choi

Viewers also liked (20)

PPTX

Architecture of a Kafka camus infrastructuremattlieber

PPTX

Data Infrastructure at LinkedInAmy W. Tang

PPTX

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

PPT

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

PPTX

LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang

PDF

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang

PDF

Data Infrastructure at LinkedIn Amy W. Tang

PDF

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang

PPTX

Introduction to Apache KafkaJeff Holoman

PDF

LinkedIn Communication ArchitectureLinkedIn

PDF

Introduction to DatabusAmy W. Tang

PDF

Building Distributed Systems Using HelixAmy W. Tang

PDF

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

PDF

What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella

PDF

Rakuten LeoFs - distributed file systemRakuten Group, Inc.

PDF

Introduction to apache kafkaSamuel Kerrien

PPTX

Apache KafkaMaher TEBOURBI

PPTX

Realtime streaming architecture in INFINARIOJozo Kovac

PDF

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...In-Memory Computing Summit

PPTX

Intro to SnappyData WebinarSnappyData

Architecture of a Kafka camus infrastructuremattlieber

Data Infrastructure at LinkedInAmy W. Tang

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang

Data Infrastructure at LinkedIn Amy W. Tang

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang

Introduction to Apache KafkaJeff Holoman

LinkedIn Communication ArchitectureLinkedIn

Introduction to DatabusAmy W. Tang

Building Distributed Systems Using HelixAmy W. Tang

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella

Rakuten LeoFs - distributed file systemRakuten Group, Inc.

Introduction to apache kafkaSamuel Kerrien

Apache KafkaMaher TEBOURBI

Realtime streaming architecture in INFINARIOJozo Kovac

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...In-Memory Computing Summit

Intro to SnappyData WebinarSnappyData

Similar to Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn (20)

PPTX

Apache Kafka at LinkedInGuozhang Wang

PPTX

How Linkedin uses Automic for Big Data ProcessesCA | Automic Software

PPTX

The "Big Data" Ecosystem at LinkedInSam Shah

PDF

The "Big Data" Ecosystem at LinkedInSam Shah

PDF

The “Big Data” Ecosystem at LinkedInKun Le

PPTX

Building a Self-Service Hadoop Platform at Linkedin with AzkabanDataWorks Summit

PPTX

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen

PDF

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal

PPTX

The Big Data Analytics Ecosystem at LinkedInrajappaiyer

PPTX

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

PPTX

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

PDF

Software Development & Architecture @ LinkedInC4Media

PPTX

Data Process Systems, connecting everythingDataWorks Summit/Hadoop Summit

PPT

Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon

PPTX

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent

PPTX

An introduction to Apache Kafka and Kafka ecosystem at LinkedInDong Lin

PPTX

Hadoop Big Data A big pictureJ S Jodha

PPTX

Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand

PPTX

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

PPTX

Real time monitoring of hadoop and spark workflowsShankar Manian

Apache Kafka at LinkedInGuozhang Wang

How Linkedin uses Automic for Big Data ProcessesCA | Automic Software

The "Big Data" Ecosystem at LinkedInSam Shah

The “Big Data” Ecosystem at LinkedInKun Le

Building a Self-Service Hadoop Platform at Linkedin with AzkabanDataWorks Summit

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen

DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal

The Big Data Analytics Ecosystem at LinkedInrajappaiyer

CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

Software Development & Architecture @ LinkedInC4Media

Data Process Systems, connecting everythingDataWorks Summit/Hadoop Summit

Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent

An introduction to Apache Kafka and Kafka ecosystem at LinkedInDong Lin

Hadoop Big Data A big pictureJ S Jodha

Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand

Developing Real-Time Data Pipelines with Apache KafkaJoe Stein

Real time monitoring of hadoop and spark workflowsShankar Manian

More from Amy W. Tang (6)

PDF

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang

PDF

LinkedIn Graph PresentationAmy W. Tang

PDF

Data Infrastructure at LinkedInAmy W. Tang

PDF

Voldemort on Solid State DrivesAmy W. Tang

PDF

Untangling Cluster Management with HelixAmy W. Tang

PDF

All Aboard the DatabusAmy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang

LinkedIn Graph PresentationAmy W. Tang

Data Infrastructure at LinkedInAmy W. Tang

Voldemort on Solid State DrivesAmy W. Tang

Untangling Cluster Management with HelixAmy W. Tang

All Aboard the DatabusAmy W. Tang

Recently uploaded (20)

PDF

5 Things to Consider When Deploying AI in Your EnterpriseSafe Software

PPTX

MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...Michele Kryston

PDF

Automating the Geo-Referencing of Historic Aerial Photography in FlandersSafe Software

PDF

Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...ScyllaDB

PDF

Optimizing the trajectory of a wheel loader working in short loading cyclesReno Filla

PDF

Java 25 and Beyond - A Roadmap of InnovationsAna-Maria Mihalceanu

PDF

Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical UniversesSaikat Basu

PDF

Plugging AI into everything: Model Context Protocol Simplified.pdfAbati Adewale

PDF

Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...treyka

PDF

ArcGIS Utility Network Migration - The Hunter Water StorySafe Software

PDF

2025_06_18 - OpenMetadata Community Meeting.pdfOpenMetadata

PDF

Hello I'm "AI" Your New _________________Dr. Tathagat Varma

DOCX

Daily Lesson Log MATATAG ICT TEchnology 8LOIDAALMAZAN3

PDF

Python Conference Singapore - 19 Jun 2025ninefyi

PPTX

New ThousandEyes Product Innovations: Cisco Live June 2025ThousandEyes

PDF

Kubernetes - Architecture & Components.pdfgeethak285

PDF

UiPath Agentic AI ile Akıllı Otomasyonun Yeni ÇağıUiPathCommunity

PDF

EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdfEarley Information Science

PPTX

UserCon Belgium: Honey, VMware increased my billstijn40

PPTX

Paycifi - Programmable Trust_Breakfast_PPTXTFinTech Belgium

5 Things to Consider When Deploying AI in Your EnterpriseSafe Software

MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...Michele Kryston

Automating the Geo-Referencing of Historic Aerial Photography in FlandersSafe Software

Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...ScyllaDB

Optimizing the trajectory of a wheel loader working in short loading cyclesReno Filla

Java 25 and Beyond - A Roadmap of InnovationsAna-Maria Mihalceanu

Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical UniversesSaikat Basu

Plugging AI into everything: Model Context Protocol Simplified.pdfAbati Adewale

Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...treyka

ArcGIS Utility Network Migration - The Hunter Water StorySafe Software

2025_06_18 - OpenMetadata Community Meeting.pdfOpenMetadata

Hello I'm "AI" Your New _________________Dr. Tathagat Varma

Daily Lesson Log MATATAG ICT TEchnology 8LOIDAALMAZAN3

Python Conference Singapore - 19 Jun 2025ninefyi

New ThousandEyes Product Innovations: Cisco Live June 2025ThousandEyes

Kubernetes - Architecture & Components.pdfgeethak285

UiPath Agentic AI ile Akıllı Otomasyonun Yeni ÇağıUiPathCommunity

EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdfEarley Information Science

UserCon Belgium: Honey, VMware increased my billstijn40

Paycifi - Programmable Trust_Breakfast_PPTXTFinTech Belgium

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

2. HADOOP SUMMIT 2013 Network update stream

4. HADOOP SUMMIT 2013 People you may know

7. HADOOP SUMMIT 2013 Point-to-point pipelines

8. HADOOP SUMMIT 2013 LinkedIn’s user activity data pipeline (circa 2010)

9. HADOOP SUMMIT 2013 Point-to-point pipelines

11. HADOOP SUMMIT 2013 Central data pipeline

13. HADOOP SUMMIT 2013

16. HADOOP SUMMIT 2013 What is a commit log?

19. HADOOP SUMMIT 2013 Usage at LinkedIn  16 brokers in each cluster  28 billion messages/day  Peak rates – Writes: 460,000 messages/second – Reads: 2,300,000 messages/second  ~ 700 topics  40-50 live services consuming user-activity data  Many ad hoc consumers  Every production service is a producer (for metrics)  10k connections/colo LinkedIn Corporation ©2013 All Rights Reserved 19

22. HADOOP SUMMIT 2013 Standardize on Avro in data pipeline LinkedIn Corporation ©2013 All Rights Reserved 22 { "type": "record", "name": "URIValidationRequestEvent", "namespace": "com.linkedin.event.usv", "fields": [ { "name": "header", "type": { "type": "record", "name": ”TrackingEventHeader", "namespace": "com.linkedin.event", "fields": [ { "name": "memberId", "type": "int", "doc": "The member id of the user initiating the action" }, { "name": ”timeMs", "type": "long", "doc": "The time of the event" }, { "name": ”host", "type": "string", ... ...

24. HADOOP SUMMIT 2013 Hadoop data load (Camus)  Open sourced: – https://github.com/linkedin/camus  One job loads all events  ~10 minute ETA on average from producer to HDFS  Hive registration done automatically  Schema evolution handled transparently

27. HADOOP SUMMIT 2013 Audit Trail

28. HADOOP SUMMIT 2013 Kafka replication (0.8)  Intra-cluster replication feature – Facilitates high availability and durability  Beta release available https://dist.apache.org/repos/dist/release/kafka/  Rolled out in production at LinkedIn last week LinkedIn Corporation ©2013 All Rights Reserved 28

29. HADOOP SUMMIT 2013 Join us at our user-group meeting tonight @ LinkedIn! – Thursday, June 27, 7.30pm to 9.30pm – 2025 Stierlin Ct., Mountain View, CA – http://www.meetup.com/http-kafka-apache-org/events/125887332/ – Presentations (replication overview and use-case studies) from:  RichRelevance  Netflix  Square  LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 29