SlideShare a Scribd company logo
www.edureka.co/r-for-analytics
www.edureka.co/apache-Kafka
How Apache Kafka is transforming
Hadoop, Spark & Storm
Slide 2Slide 2Slide 2 www.edureka.co/apache-Kafka
 Million Dollar Question! Why we need Kafka?
 What is Kafka?
 Kafka Architecture
 Kafka with Hadoop
 Kafka with Spark
 Kafka with Storm
 Companies using Kafka
 Demo on Kafka Messaging Service…
What will you learn today?
Million Dollar Question!
Why we need Kafka??
Slide 4Slide 4Slide 4 www.edureka.co/apache-Kafka
Why Kafka is preferred in place of
more traditional brokers like JMS
and AMQP
Why Kafka Cluster?
Slide 5Slide 5Slide 5 www.edureka.co/apache-Kafka
Kafka Producer Performance with Other Systems
Slide 6Slide 6Slide 6 www.edureka.co/apache-Kafka
Kafka Consumer Performance with Other Systems
Slide 7Slide 7Slide 7 www.edureka.co/apache-Kafka
Salient Features of Kafka
Feature Description
High Throughput Support for millions of messages with modest hardware
Scalability Highly scalable distributed systems with no downtime
Replication
Messages can be replicated across cluster, which provides support for multiple
subscribers and also in case of failure balances the consumers
Durability Provides support for persistence of messages to disk which can be further used for
batch consumption
Stream Processing Kafka can be used along with real time streaming applications like spark and storm
Data Loss Kafka with the proper configurations can ensure zero data loss
Slide 8Slide 8Slide 8 www.edureka.co/apache-Kafka
 With Kafka, we can easily handle hundreds and thousands of messages in a second
 The cluster can be expanded with no downtime, making Kafka highly scalable
 Messages are replicated, which provides reliability and durability
 Fault tolerant
Scalable
Kafka Advantages
What is Kafka?
Slide 10Slide 10Slide 10 www.edureka.co/apache-Kafka
 A distributed publish-subscribe messaging system
 Developed at LinkedIn Corporation
 Provides solution to handle all activity stream data
 Fully supported in Hadoop platform
 Partitions real time consumption across cluster of machines
 Provides a mechanism for parallel load into Hadoop
What is Kafka ?
Slide 11Slide 11Slide 11 www.edureka.co/apache-Kafka
Apache Kafka – Overview
Kafka
External
Tracking Proxy
Frontend FrontendFrontend
Background
Service
(Consumer)
Background
Service
(Consumer)
Hadoop DWH
Background
Service
(Producer)
Background
Service
(Producer)
Kafka Architecture
Slide 13Slide 13Slide 13 www.edureka.co/apache-Kafka
Kafka Architecture
Producer
(Front End)
Producer
(Services)
Producer
(Proxies)
Producer
(Adapters)
Other
Producer
Zookeeper
Consumers
(Real Time)
Consumers
(NoSQL)
Consumers
(Hadoop)
Consumers
(Warehouses)
Other
Producer
Kafka Kafka Kafka Kafka Broker
Slide 14Slide 14Slide 14 www.edureka.co/apache-Kafka
 Below table lists the core concepts of Kafka
Kafka Core Components
Feature Description
Topic A category or feed to which messages are published
Producer Publishes messages to the Kafka Topic
Consumer Subscribes and consumes messages from Kafka Topic
Broker Handles hundreds of megabytes of reads and writes
Slide 15Slide 15Slide 15 www.edureka.co/apache-Kafka
Kafka Topic
 A user defined category where the messages are published
 For each topic a partition log is maintained
 Each partition basically contains an ordered, immutable sequence of messages where each message is assigned a
sequential ID number called offset
 Writes to a partition are generally sequential thereby reducing the number of hard disk seeks
 Reading messages from partition can be random
Slide 16Slide 16Slide 16 www.edureka.co/apache-Kafka
 Applications publishes messages to the topic in kafka cluster.
 Can be of any kind like front end, streaming etc.
 While writing messages, it is also possible to attach a key with the
message
Same key will arrive in the same partition
 Doesn’t wait for the acknowledgement from the kafka cluster
 Publishes as much messages as fast as the broker in a cluster can handle
Kafka Producers
Kafka
Clusters
Producer
Producer
Producer
Slide 17Slide 17Slide 17 www.edureka.co/apache-Kafka
Kafka Consumers
 Applications subscribes and consumes messages from the brokers in
Kafka cluster
 Can be of any kind like real time consumers, NoSQL consumers, etc.
 During consumption of messages from a topic, a consumer group
can be configured with multiple consumers
 Each consumer of consumer group reads messages from a unique
subset of partitions in each topic they subscribe to
 Messages with same key arrives at same consumer
 Supports both Queuing and Publish-Subscribe
 Consumers have to maintain the number of messages consumed
Kafka Clusters
Consumer
Consumer
Consumer
Slide 18Slide 18Slide 18 www.edureka.co/apache-Kafka
Each server in the cluster is called a broker
 Handles hundreds of MBs of writes from producers and reads
from consumers
 Retains all published messages irrespective of whether it is
consumed or not
 Retention is configured for n days
 Published messages is available for consumptions for
configured ‘n’ days and thereafter it is discarded
 Works like a queue if consumer instances belong to same
consumer group, else works like publish-subscribe
Kafka Brokers
Slide 19Slide 19Slide 19 www.edureka.co/apache-Kafka
Kafka Producer-Broker-Consumer
Slide 20Slide 20Slide 20 www.edureka.co/apache-Kafka
How Kafka can be used with Hadoop
Slide 21Slide 21Slide 21 www.edureka.co/apache-Kafka
Kafka with Hadoop using Camus
 Camus is LinkedIn's Kafka ->HDFS pipeline
 It is a MapReduce job
Distributes data loads out of Kafka
At LinkedIn, it processes tens of billions of messages/day
All work done with one single Hadoop job
Courtesy : confluent
Slide 22Slide 22Slide 22 www.edureka.co/apache-Kafka
How Kafka can be used with Spark
Slide 23Slide 23Slide 23 www.edureka.co/apache-Kafka
Kafka With Spark Streaming
If messages are stored in ‘n’ partitions, parallel reading makes things faster
Generally in Kafka messages are stored in multiple partitions
Parallel reads can be effectively achieved by spark streaming
Parallelism of reads is achieved by integrating KafkaInputDStream of Spark with Kafka High Level Consumer API
Slide 24 www.edureka.co/apache-Kafka
APPS
Kafka
E V E N T S
STREAMING ENGINE
Kafka With Spark Streaming
Generally in Kafka messages are stored in multiple partitions
Slide 25Slide 25Slide 25 www.edureka.co/apache-Kafka
How Kafka can be used with Storm
Slide 26Slide 26Slide 26 www.edureka.co/apache-Kafka
Kafka With Spark Streaming
Slide 27Slide 27Slide 27 www.edureka.co/apache-Kafka
Companies Using Kafka
Slide 28Slide 28Slide 28 www.edureka.co/apache-Kafka
Get Certified in Apache Kafka from Edureka
Edureka's Real-Time Analytics with Apache Kafka course:
• Carefully designed to provide knowledge and skills to become a successful Kafka Big Data Developer
• Helps you master the concepts of Kafka Cluster, Producers and Consumers, Kafka API, Kafka Integration with Hadoop, Storm
and Spark
• Encompasses the fundamental concepts like Kafka cluster, Kafka API to advance topics such as Kafka integration with
Hadoop, Storm, Spark, Maven etc.
• Online Live Courses: 15 hours
• Assignments: 25 hours
• Project: 20 hours
• Lifetime Access + 24 X 7 Support
Go to www.edureka.co/apache-kafka
Batch starts from 10th October (Weekend Batch)
Thank You
Questions/Queries/Feedback/Survey
Recording and presentation will be made available to you within 24 hours

More Related Content

PPTX
kafka for db as postgres
PDF
Kafka internals
PPTX
Real time Messages at Scale with Apache Kafka and Couchbase
PDF
Introduction to Apache Kafka and why it matters - Madrid
PDF
Kafka clients and emitters
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PPTX
Introduction to Kafka and Zookeeper
PPTX
Architecture of a Kafka camus infrastructure
kafka for db as postgres
Kafka internals
Real time Messages at Scale with Apache Kafka and Couchbase
Introduction to Apache Kafka and why it matters - Madrid
Kafka clients and emitters
Developing Real-Time Data Pipelines with Apache Kafka
Introduction to Kafka and Zookeeper
Architecture of a Kafka camus infrastructure

What's hot (20)

PPTX
Kafka connect-london-meetup-2016
PDF
Apache kafka
PPTX
Current and Future of Apache Kafka
PPTX
Design Patterns for working with Fast Data
PDF
An Introduction to Apache Kafka
PPTX
Real time analytics with Kafka and SparkStreaming
PPTX
Introduction Apache Kafka
PPTX
I Heart Log: Real-time Data and Apache Kafka
PPTX
Kafka blr-meetup-presentation - Kafka internals
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
PDF
Apache Kafka - Scalable Message-Processing and more !
PPTX
Kafka Streams for Java enthusiasts
PPTX
Intro to Apache Kafka
PDF
101 ways to configure kafka - badly (Kafka Summit)
PPTX
Fraud Detection for Israel BigThings Meetup
PDF
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
PDF
Data Pipeline with Kafka
PDF
Apache Kafka - Scalable Message-Processing and more !
PPTX
Matt Franklin - Apache Software (Geekfest)
PPTX
Apache Kafka at LinkedIn
Kafka connect-london-meetup-2016
Apache kafka
Current and Future of Apache Kafka
Design Patterns for working with Fast Data
An Introduction to Apache Kafka
Real time analytics with Kafka and SparkStreaming
Introduction Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Kafka blr-meetup-presentation - Kafka internals
Kafka & Hadoop - for NYC Kafka Meetup
Apache Kafka - Scalable Message-Processing and more !
Kafka Streams for Java enthusiasts
Intro to Apache Kafka
101 ways to configure kafka - badly (Kafka Summit)
Fraud Detection for Israel BigThings Meetup
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Data Pipeline with Kafka
Apache Kafka - Scalable Message-Processing and more !
Matt Franklin - Apache Software (Geekfest)
Apache Kafka at LinkedIn
Ad

Similar to How Apache Kafka is transforming Hadoop, Spark and Storm (20)

PPTX
How kafka is transforming hadoop, spark & storm
PPTX
Apache Kafka: Next Generation Distributed Messaging System
PPTX
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
PDF
Fault Tolerance with Kafka
PPTX
Understanding kafka
PPTX
Kafka presentation
PPTX
Westpac Bank Tech Talk 1: Dive into Apache Kafka
PDF
Apache kafka
PPTX
Apache kafka
PPTX
Apache kafka
PPTX
Kafka Basic For Beginners
PPTX
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
PDF
Connect K of SMACK:pykafka, kafka-python or?
PDF
Apache Kafka - Scalable Message-Processing and more !
PPTX
Apache kafka
PDF
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
PDF
PDF
Apache Kafka Introduction
PPTX
Data Integration with Apache Kafka: What, Why, How
PDF
How kafka is transforming hadoop, spark & storm
Apache Kafka: Next Generation Distributed Messaging System
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Fault Tolerance with Kafka
Understanding kafka
Kafka presentation
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Apache kafka
Apache kafka
Apache kafka
Kafka Basic For Beginners
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
Connect K of SMACK:pykafka, kafka-python or?
Apache Kafka - Scalable Message-Processing and more !
Apache kafka
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Apache Kafka Introduction
Data Integration with Apache Kafka: What, Why, How
Ad

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PPTX
A Presentation on Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
A comparative analysis of optical character recognition models for extracting...
PPT
Teaching material agriculture food technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Approach and Philosophy of On baking technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
A Presentation on Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
A comparative analysis of optical character recognition models for extracting...
Teaching material agriculture food technology

How Apache Kafka is transforming Hadoop, Spark and Storm

  • 2. Slide 2Slide 2Slide 2 www.edureka.co/apache-Kafka  Million Dollar Question! Why we need Kafka?  What is Kafka?  Kafka Architecture  Kafka with Hadoop  Kafka with Spark  Kafka with Storm  Companies using Kafka  Demo on Kafka Messaging Service… What will you learn today?
  • 4. Slide 4Slide 4Slide 4 www.edureka.co/apache-Kafka Why Kafka is preferred in place of more traditional brokers like JMS and AMQP Why Kafka Cluster?
  • 5. Slide 5Slide 5Slide 5 www.edureka.co/apache-Kafka Kafka Producer Performance with Other Systems
  • 6. Slide 6Slide 6Slide 6 www.edureka.co/apache-Kafka Kafka Consumer Performance with Other Systems
  • 7. Slide 7Slide 7Slide 7 www.edureka.co/apache-Kafka Salient Features of Kafka Feature Description High Throughput Support for millions of messages with modest hardware Scalability Highly scalable distributed systems with no downtime Replication Messages can be replicated across cluster, which provides support for multiple subscribers and also in case of failure balances the consumers Durability Provides support for persistence of messages to disk which can be further used for batch consumption Stream Processing Kafka can be used along with real time streaming applications like spark and storm Data Loss Kafka with the proper configurations can ensure zero data loss
  • 8. Slide 8Slide 8Slide 8 www.edureka.co/apache-Kafka  With Kafka, we can easily handle hundreds and thousands of messages in a second  The cluster can be expanded with no downtime, making Kafka highly scalable  Messages are replicated, which provides reliability and durability  Fault tolerant Scalable Kafka Advantages
  • 10. Slide 10Slide 10Slide 10 www.edureka.co/apache-Kafka  A distributed publish-subscribe messaging system  Developed at LinkedIn Corporation  Provides solution to handle all activity stream data  Fully supported in Hadoop platform  Partitions real time consumption across cluster of machines  Provides a mechanism for parallel load into Hadoop What is Kafka ?
  • 11. Slide 11Slide 11Slide 11 www.edureka.co/apache-Kafka Apache Kafka – Overview Kafka External Tracking Proxy Frontend FrontendFrontend Background Service (Consumer) Background Service (Consumer) Hadoop DWH Background Service (Producer) Background Service (Producer)
  • 13. Slide 13Slide 13Slide 13 www.edureka.co/apache-Kafka Kafka Architecture Producer (Front End) Producer (Services) Producer (Proxies) Producer (Adapters) Other Producer Zookeeper Consumers (Real Time) Consumers (NoSQL) Consumers (Hadoop) Consumers (Warehouses) Other Producer Kafka Kafka Kafka Kafka Broker
  • 14. Slide 14Slide 14Slide 14 www.edureka.co/apache-Kafka  Below table lists the core concepts of Kafka Kafka Core Components Feature Description Topic A category or feed to which messages are published Producer Publishes messages to the Kafka Topic Consumer Subscribes and consumes messages from Kafka Topic Broker Handles hundreds of megabytes of reads and writes
  • 15. Slide 15Slide 15Slide 15 www.edureka.co/apache-Kafka Kafka Topic  A user defined category where the messages are published  For each topic a partition log is maintained  Each partition basically contains an ordered, immutable sequence of messages where each message is assigned a sequential ID number called offset  Writes to a partition are generally sequential thereby reducing the number of hard disk seeks  Reading messages from partition can be random
  • 16. Slide 16Slide 16Slide 16 www.edureka.co/apache-Kafka  Applications publishes messages to the topic in kafka cluster.  Can be of any kind like front end, streaming etc.  While writing messages, it is also possible to attach a key with the message Same key will arrive in the same partition  Doesn’t wait for the acknowledgement from the kafka cluster  Publishes as much messages as fast as the broker in a cluster can handle Kafka Producers Kafka Clusters Producer Producer Producer
  • 17. Slide 17Slide 17Slide 17 www.edureka.co/apache-Kafka Kafka Consumers  Applications subscribes and consumes messages from the brokers in Kafka cluster  Can be of any kind like real time consumers, NoSQL consumers, etc.  During consumption of messages from a topic, a consumer group can be configured with multiple consumers  Each consumer of consumer group reads messages from a unique subset of partitions in each topic they subscribe to  Messages with same key arrives at same consumer  Supports both Queuing and Publish-Subscribe  Consumers have to maintain the number of messages consumed Kafka Clusters Consumer Consumer Consumer
  • 18. Slide 18Slide 18Slide 18 www.edureka.co/apache-Kafka Each server in the cluster is called a broker  Handles hundreds of MBs of writes from producers and reads from consumers  Retains all published messages irrespective of whether it is consumed or not  Retention is configured for n days  Published messages is available for consumptions for configured ‘n’ days and thereafter it is discarded  Works like a queue if consumer instances belong to same consumer group, else works like publish-subscribe Kafka Brokers
  • 19. Slide 19Slide 19Slide 19 www.edureka.co/apache-Kafka Kafka Producer-Broker-Consumer
  • 20. Slide 20Slide 20Slide 20 www.edureka.co/apache-Kafka How Kafka can be used with Hadoop
  • 21. Slide 21Slide 21Slide 21 www.edureka.co/apache-Kafka Kafka with Hadoop using Camus  Camus is LinkedIn's Kafka ->HDFS pipeline  It is a MapReduce job Distributes data loads out of Kafka At LinkedIn, it processes tens of billions of messages/day All work done with one single Hadoop job Courtesy : confluent
  • 22. Slide 22Slide 22Slide 22 www.edureka.co/apache-Kafka How Kafka can be used with Spark
  • 23. Slide 23Slide 23Slide 23 www.edureka.co/apache-Kafka Kafka With Spark Streaming If messages are stored in ‘n’ partitions, parallel reading makes things faster Generally in Kafka messages are stored in multiple partitions Parallel reads can be effectively achieved by spark streaming Parallelism of reads is achieved by integrating KafkaInputDStream of Spark with Kafka High Level Consumer API
  • 24. Slide 24 www.edureka.co/apache-Kafka APPS Kafka E V E N T S STREAMING ENGINE Kafka With Spark Streaming Generally in Kafka messages are stored in multiple partitions
  • 25. Slide 25Slide 25Slide 25 www.edureka.co/apache-Kafka How Kafka can be used with Storm
  • 26. Slide 26Slide 26Slide 26 www.edureka.co/apache-Kafka Kafka With Spark Streaming
  • 27. Slide 27Slide 27Slide 27 www.edureka.co/apache-Kafka Companies Using Kafka
  • 28. Slide 28Slide 28Slide 28 www.edureka.co/apache-Kafka Get Certified in Apache Kafka from Edureka Edureka's Real-Time Analytics with Apache Kafka course: • Carefully designed to provide knowledge and skills to become a successful Kafka Big Data Developer • Helps you master the concepts of Kafka Cluster, Producers and Consumers, Kafka API, Kafka Integration with Hadoop, Storm and Spark • Encompasses the fundamental concepts like Kafka cluster, Kafka API to advance topics such as Kafka integration with Hadoop, Storm, Spark, Maven etc. • Online Live Courses: 15 hours • Assignments: 25 hours • Project: 20 hours • Lifetime Access + 24 X 7 Support Go to www.edureka.co/apache-kafka Batch starts from 10th October (Weekend Batch)
  • 29. Thank You Questions/Queries/Feedback/Survey Recording and presentation will be made available to you within 24 hours