SlideShare a Scribd company logo
11
Introduction to Apache Kafka and Confluent
... and why they matter!
Kafka Meetup - Johannesburg
Tuesday, March 20th 2018
18:00 – 20:00
SSA - Maxwell Office Park, Magwa Cres, Waterfall City, Midrand, 2090 · Midrand
https://www.meetup.com/Johannesburg-Kafka-Meetup/events/248465767/
22
How Organizations Handle Data Flows: a Giant Mess
Data
Warehouse
Hadoop
NoSQL
Oracle
SFDC
Logging
Bloomberg
…any sink/source
Web Custom Apps Microservices Monitoring Analytics
…and more
OLTP
ActiveMQ
App App
Caches
OLTP OLTPAppAppApp
33
Apache Kafka™: A Distributed Streaming Platform
Apache Kafka
Offline Batch (+1 Hour)Near-Real Time (>100s ms)Real Time (0-100 ms)
Data
Warehouse
Hadoop
NoSQL
Oracle
SFDC
Twitter
Bloomberg
…any sink/source …any sink/source
…and more
Web Custom Apps Microservices Monitoring Analytics
44
More than 1
petabyte of
data in Kafka
Over 1.2
trillion
messages per
day
Thousands of
data streams
Source of all
data
warehouse &
Hadoop data
Over 300
billion user-
related events
per day
55
Over 35% of Fortune 500’s are using Apache Kafka™
6 of top 10
Travel
7 of top 10
Global banks
8 of top 10
Insurance
9 of top 10
Telecom
66
Industry Trends… and why Apache Kafka matters!
1. From ‘big data’ (batch) to ‘fast data’ (stream processing)
2. Internet of Things (IoT) and sensor data
3. Microservices and asynchronous communication (coordination
messages and data streams) between loosely coupled and fine-
grained services
77
Apache Kafka APIs – A UNIX Analogy
$ cat < in.txt | grep "apache" | tr a-z A-Z > out.txt
Connect APIs
Streams APIs
Producer / Consumer APIs
88
Apache Kafka API – ETL Analogy
Source SinkConnectAPI
ConnectAPI
Streams API
Extract Transform Load
99
Apache Kafka 101
Internals and Core Concepts
1010
Apache Kafka Concepts: Persistent Log
Data Producer
0 1 2 3 4 5 6 7 8 9 10 11 12
writes
Data Consumer
(offset = 7)
Data Consumer
(offset = 11)
reads reads
1111
Apache Kafka Concepts: Anatomy of a Topic
0 1 2 3 4 5 6 7 8 9 10 11 12partition 0
0 1 2 3 4 5 6 7
40 1 2 3 5
partition 1
partition 2
writes
1212
Apache Kafka Concepts: Log Storage
offset index
timestamp index
offsets: 0 - 10000
offset index
timestamp index
offsets: 10001 - 20000
offset index
timestamp index
offsets: 20001 - 30000
1313
Apache Kafka Concepts: Message Format
8 bytes 4 bytes 4 bytes 8 bytes 4 bytes varies 4 bytes varies
offset length CRC timesta
mp
key
length
value
length
key
content
value
content
magic
byte
1 byte
attribute
1 byte
1414
Apache Kafka Concepts: Producers and Consumers
Producer
Producer
Producer
Consumer
Consumer
Broker
Broker
Broker
1515
Apache Kafka Concepts: Topics and Partitions
Producer
Producer
Producer
Consumer
Consumer
Broker
Broker
Broker
T0: P0
T0: P2
T0: P1
T0: P3
T1: P0
T1: P1
1616
Apache Kafka Concepts: Fault Tolerance and Replication
Producer
Producer
Producer
Consumer
Consumer
Broker
Broker
Broker
T0: P0
T0: P0 (Replica 1)
T1: P0
T1: P0 (Replica 1)
1717
Apache Kafka Concepts: Consumer Groups
Producer
Producer
Producer
Consumer
Broker
Broker
Broker
T0: P0
T0: P2
T0: P1
T0: P3
T1: P0
T1: P1
Consumer
Consumer
Consumer
1818
The Connect API of Apache Kafka®
 Centralized management and configuration
 Support for hundreds of technologies
including RDBMS, Elasticsearch, HDFS, S3
 Supports CDC ingest of events from RDBMS
 Preserves data schema
 Fault tolerant and automatically load balanced
 Extensible API
 Single Message Transforms
 Part of Apache Kafka, included in
Confluent Open Source
Reliable and scalable integration of Kafka
with other systems – no coding required.
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
"table.whitelist": "sales,orders,customers"
}
https://docs.confluent.io/current/connect/
1919
Build Applications, not Clusters
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>1.0.0</version>
</dependency>
2020
Spot the Difference(s)!
2121
How do I run in production?
2222
How do I run in production?
Uncool Cool
2323
How do I run in production?
http://docs.confluent.io/current/streams/introduction.html
2424
Elastic and Scalable
http://docs.confluent.io/current/streams/developer-guide.html#elastic-scaling-of-your-application
2525
Elastic and Scalable
http://docs.confluent.io/current/streams/developer-guide.html#elastic-scaling-of-your-application
2626
Elastic and Scalable
http://docs.confluent.io/current/streams/developer-guide.html#elastic-scaling-of-your-application
2727
Typical High Level Architecture
Real-time
Data
Ingestion
2828
Typical High Level Architecture
Stream
Processing
Real-time
Data
Ingestion
2929
Typical High Level Architecture
Stream
Processing
Storage
Real-time
Data
Ingestion
3030
Typical High Level Architecture
Data Publishing /
Visualization
Stream
Processing
Storage
Real-time
Data
Ingestion
3131
How many clusters do you count?
NoSQL (Cassandra,
HBase, Couchbase,
MongoDB, …) or
Elasticsearch, Solr,
…
Storm, Flink, Spark
Streaming, Ignite,
Akka Streams, Apex,
…
HDFS, NFS, Ceph,
GlusterFS, Lustre,
...
Apache Kafka
3232
Simplicity is the Ultimate Sophistication
Node.js
Apache Kafka
Distributed Streaming Platform
Publish & Subscribe
to streams of data like a
messaging system
Store
streams of data safely in a
distributed replicated cluster
Process
streams of data efficiently
and in real-time
3333
Duality of Streams and Tables
http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
3434
Duality of Streams and Tables
http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
3535
Interactive Queries
http://docs.confluent.io/current/streams/developer-guide.html#streams-developer-guide-interactive-queries
3636
Interactive Queries
http://docs.confluent.io/current/streams/developer-guide.html#streams-developer-guide-interactive-queries
3737
Kafka Streams DSL
http://docs.confluent.io/current/streams/developer-guide.html#kafka-streams-dsl
3838
WordCount (and Java 8+)
WordCountLambdaExample.java
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-lambda-example");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
...
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
final KStreamBuilder builder = new KStreamBuilder();
final KStream<String, String> textLines = builder.stream(stringSerde, stringSerde,
"TextLinesTopic");
final Pattern pattern = Pattern.compile("W+", Pattern.UNICODE_CHARACTER_CLASS);
final KTable<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase())))
.groupBy((key, word) -> word)
.count("Counts");
wordCounts.to(stringSerde, longSerde, "WordsWithCountsTopic");
final KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration);
streams.cleanUp();
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
3939
Easy to Develop with, Easy to Test
WordCountLambdaIntegrationTest.java
EmbeddedSingleNodeKafkaCluster CLUSTER = new EmbeddedSingleNodeKafkaCluster();
...
CLUSTER.createTopic(inputTopic);
...
Properties producerConfig = new Properties();
producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
CLUSTER.bootstrapServers());
4040
The Streams API of Apache Kafka®
 No separate processing cluster required
 Develop on Mac, Linux, Windows
 Deploy to containers, VMs, bare metal, cloud
 Powered by Kafka: elastic, scalable,
distributed, battle-tested
 Perfect for small, medium, large use cases
 Fully integrated with Kafka security
 Exactly-once processing semantics
 Part of Apache Kafka, included in
Confluent Open Source
Write standard Java applications and microservices
to process your data in real-time
KStream<User, PageViewEvent> pageViews = builder.stream("pageviews-topic");
KTable<Windowed<User>, Long> viewsPerUserSession = pageViews
.groupByKey()
.count(SessionWindows.with(TimeUnit.MINUTES.toMillis(5)), "session-views");
https://docs.confluent.io/current/streams/
4141
KSQL: a Streaming SQL Engine for Apache Kafka® from Confluent
 No coding required, all you need is SQL
 No separate processing cluster required
 Powered by Kafka: elastic, scalable,
distributed, battle-tested
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u
ON c.userid = u.userid
WHERE u.level = 'Platinum';
KSQL is the simplest way to process streams of data in real-time
 Perfect for streaming ETL, anomaly detection,
event monitoring, and more
 Part of Confluent Open Source
https://github.com/confluentinc/ksql
Do you think that’s a
table you are querying ?
4343
KSQL in less than 5 minutes
https://www.youtube.com/watch?v=A45uRzJiv7I
4444
Confluent Enterprise: Logical Architecture
Kafka Cluster
Mainframe
Kafka Connect Servers
Kafka ConnectRDBMS
Hadoop
Cassandra
Elasticsearch
Kafka Connect Servers
Kafka Connect
Files
Producer
Application
Consumer
ApplicationZookeeper
Kafka Broker
REST Proxy Servers
REST Proxy
REST Client
Control Center Servers
Control Center
Schema Registry Servers
Schema Registry
Kafka Producer APIs Kafka Consumer APIs
Stream Processing Application 1
Stream Client
Stream Processing Application 2
Stream Client
REST Proxy Servers
REST Proxy
REST Client
4545
Confluent Enterprise: Physical Architecture
Rack 1
Kafka Broker #1
ToR Switch
ToR Switch
Schema Registry #1
Kafka Connect #1
Zookeeper #1
REST Proxy #1
Kafka Broker #4
Zookeeper #4
Rack 2
Kafka Broker #2
ToR Switch
ToR Switch
Schema Registry #2
Kafka Connect #2
Zookeeper #2
Kafka Broker #5
Zookeeper #5
Rack 3
Kafka Broker #3
ToR Switch
ToR Switch
Kafka Connect #3
Zookeeper #3
Core Switch Core Switch
REST Proxy #2
Load Balancer Load Balancer
Control Center #1 Control Center #2
4646
Confluent Completes Kafka
Feature Benefit Apache Kafka Confluent Open Source Confluent Enterprise
Apache Kafka
High throughput, low latency, high availability, secure distributed streaming
platform
Kafka Connect API Advanced API for connecting external sources/destinations into Kafka
Kafka Streams API
Simple library that enables streaming application development within the
Kafka framework
Additional Clients Supports non-Java clients; C, C++, Python, .NET and several others
REST Proxy
Provides universal access to Kafka from any network connected device via
HTTP
Schema Registry
Central registry for the format of Kafka data – guarantees all data is always
consumable
Pre-Built Connectors
HDFS, JDBC, Elasticsearch, Amazon S3 and other connectors fully certified
and supported by Confluent
JMS Client
Support for legacy Java Message Service (JMS) applications consuming
and producing directly from Kafka
Confluent Control
Center
Enables easy connector management, monitoring and alerting for a Kafka
cluster
Auto Data Balancer Rebalancing data across cluster to remove bottlenecks
Replicator Multi-datacenter replication simplifies and automates MDC Kafka clusters
Support
Enterprise class support to keep your Kafka environment running at top
performance Community Community 24x7x365
4747
Big Data and Fast Data Ecosystems
Synchronous Req/Response
0 – 100s ms
Near Real Time
> 100s ms
Offline Batch
> 1 hour
Apache Kafka
Stream Data Platform
Search
RDBMS
Apps Monitoring
Real-time
Analytics
NoSQL
Stream
Processing
Apache Hadoop
Data Lake
Impala
DWH
Hive
Spark Map-Reduce
Confluent HDFS Connector
(exactly once semantics)
https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/
4848
Building a Microservices Ecosystem with Kafka Streams and KSQL
https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/
https://github.com/confluentinc/kafka-streams-examples/tree/3.3.0-post/src/main/java/io/confluent/examples/streams/microservices
4949
Microservices: References
Blog posts series:
Part 1: The Data Dichotomy: Rethinking the Way We Treat Data and Services
https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and-services/
Part 2: Build Services on a Backbone of Events
https://www.confluent.io/blog/build-services-backbone-events/
Part 3: Using Apache Kafka as a Scalable, Event-Driven Backbone for Service Architectures
https://www.confluent.io/blog/apache-kafka-for-service-architectures/
Part 4: Chain Services with Exactly Once Guarantees
https://www.confluent.io/blog/chain-services-exactly-guarantees/
Part 5: Messaging as the Single Source of Truth
https://www.confluent.io/blog/messaging-single-source-truth/
Part 6: Leveraging the Power of a Database Unbundled
https://www.confluent.io/blog/leveraging-power-database-unbundled/
Part 7: Building a Microservices Ecosystem with Kafka Streams and KSQL
https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/
Whitepaper:
Microservices in the Apache Kafka™ Ecosystem
https://www.confluent.io/resources/microservices-in-the-apache-kafka-ecosystem/
5050
Apache Kafka Security
Security
• Processes customer data
• Regulatory requirements
• Legal compliance
• Internal security policies
• Need is not limited to
industries such as finance,
healthcare, or governmental
services
Authentication
• Scenario example: “Only certain applications may talk to the production
Kafka cluster”
• Client authentication via SASL – e.g. Kerberos, Active Directory
Authorization
• Scenario example: “Only certain applications may read data from
sensitive Kafka topics”
• Restrict who can create, write to, read from topics, and more
Encryption
• Scenario example: “Data-in-transit between apps and Kafka clusters
must be encrypted”
• SSL supported
• Encrypts data exchanged between Kafka brokers, between Kafka brokers
and Kafka clients/apps
Help meeting security requirements by supporting:
5151
Enterprise Ready Multi-Datacenter Replication for Kafka
Data Center in USA
Kafka Cluster (USA)
Kafka Broker 1
Kafka Broker 2
Kafka Broker 3
ZooKeeper 1
ZooKeeper 2
ZooKeeper 3
Control Center
Kafka Connect
Cluster
Replicator 1
Replicator 2
Data Center in EMEA
Kafka Cluster (EU)
Kafka Broker 1
Kafka Broker 2
Kafka Broker 3
ZooKeeper 1
ZooKeeper 2
ZooKeeper 3
Control Center
Kafka Connect
Cluster
Replicator 1
Replicator 2
Available only with Confluent Enterprise
Apache Kafka and Confluent Open Source
5252
Cloud Synchronization and Migrations with Confluent Enterprise: Before
DC1
DB2
DB1
DWH
App2
App3
App4
KV2KV3
DB3
App2-v2
App5
App7
App1-v2
AWS
App8
DWH
App1
Challenges
• Each team/department
must execute their own cloud
migration
• May be moving the same data
multiple times
• Each box represented here
require development, testing,
deployment, monitoring and
maintenance
KV
5353
DC1
Cloud Synchronization and Migrations with Confluent Enterprise: After
DB2
DB1
KV
DWH
App2
App4
KV2KV3
App2-v2
App5 App7
App1-v2
AWS
App8
DWH
App1
Kafka
Kafka
App3
Benefits
• Continuous low-latency
synchronization
• Centralized manageability and
monitoring
– Track at event level data
produced in all data centers
• Security and governance
– Track and control where data
comes from and who is
accessing it
• Cost Savings
– Move Data Once
DB3
5454
About Confluent and Apache Kafka™
70% of active Kafka
Committers
Founded
September 2014
Technology developed
while at LinkedIn
Founded by the creators of
Apache Kafka
5555
Apache Kafka: PMC members and committers
https://kafka.apache.org/committers
PMC
PMC PMC PMCPMC PMC PMC PMC
PMC PMC PMC
5656
Download Confluent Platform: the easiest way to get you started
https://www.confluent.io/download/
5757
Books: get them all three in PDF format from Confluent website!
https://www.confluent.io/apache-kafka-stream-processing-book-bundle
5858
Discount code: kacom17
Presented by
https://kafka-summit.org/
Presented by

More Related Content

PPTX
Introduction to Apache Kafka
ODP
Stream processing using Kafka
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PDF
From Zero to Hero with Kafka Connect
PDF
Fundamentals of Apache Kafka
PPTX
Deep Dive into Apache Kafka
PDF
An Introduction to Apache Kafka
Introduction to Apache Kafka
Stream processing using Kafka
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
From Zero to Hero with Kafka Connect
Fundamentals of Apache Kafka
Deep Dive into Apache Kafka
An Introduction to Apache Kafka

What's hot (20)

PDF
Introduction to Kafka Streams
PPTX
An Introduction to Confluent Cloud: Apache Kafka as a Service
PDF
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
PDF
Apache Kafka Introduction
PDF
Introducing the Apache Flink Kubernetes Operator
PDF
The Patterns of Distributed Logging and Containers
PDF
Integrating Apache Kafka Into Your Environment
PPTX
Introduction to Apache Camel
PPT
Oracle WebLogic Server Basic Concepts
PPTX
Apache kafka
PPTX
Kafka presentation
PDF
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
Kafka 101
PDF
Producer Performance Tuning for Apache Kafka
PDF
Can Apache Kafka Replace a Database?
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PDF
Introduction to Apache Kafka
PPTX
Elastic stack Presentation
PDF
톰캣 운영 노하우
Introduction to Kafka Streams
An Introduction to Confluent Cloud: Apache Kafka as a Service
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
Apache Kafka Introduction
Introducing the Apache Flink Kubernetes Operator
The Patterns of Distributed Logging and Containers
Integrating Apache Kafka Into Your Environment
Introduction to Apache Camel
Oracle WebLogic Server Basic Concepts
Apache kafka
Kafka presentation
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Apache Kafka Architecture & Fundamentals Explained
Kafka 101
Producer Performance Tuning for Apache Kafka
Can Apache Kafka Replace a Database?
APACHE KAFKA / Kafka Connect / Kafka Streams
Introduction to Apache Kafka
Elastic stack Presentation
톰캣 운영 노하우
Ad

Similar to Introduction to apache kafka, confluent and why they matter (20)

PDF
Introduction to Apache Kafka and Confluent... and why they matter
PDF
Introduction to Apache Kafka and Confluent... and why they matter!
PDF
Streaming ETL with Apache Kafka and KSQL
PDF
Concepts and Patterns for Streaming Services with Kafka
PDF
Introduction to Apache Kafka and why it matters - Madrid
PPTX
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
PDF
JHipster conf 2019 - Kafka Ecosystem
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PPTX
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
PDF
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
PDF
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
PDF
Apache Kafka vs. Traditional Middleware (Kai Waehner, Confluent) Frankfurt 20...
PPTX
Data Streaming with Apache Kafka & MongoDB - EMEA
PPTX
Webinar: Data Streaming with Apache Kafka & MongoDB
PDF
Apache Kafka - Scalable Message-Processing and more !
PPTX
Data analytics at scale implementing stateful stream processing - publish
PDF
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Introduction to Apache Kafka and Confluent... and why they matter
Introduction to Apache Kafka and Confluent... and why they matter!
Streaming ETL with Apache Kafka and KSQL
Concepts and Patterns for Streaming Services with Kafka
Introduction to Apache Kafka and why it matters - Madrid
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
JHipster conf 2019 - Kafka Ecosystem
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Traditional Middleware (Kai Waehner, Confluent) Frankfurt 20...
Data Streaming with Apache Kafka & MongoDB - EMEA
Webinar: Data Streaming with Apache Kafka & MongoDB
Apache Kafka - Scalable Message-Processing and more !
Data analytics at scale implementing stateful stream processing - publish
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Ad

More from Paolo Castagna (6)

PDF
Message Driven and Event Sourcing
PDF
Confluent and Elastic: a Lovely Couple - Elastic Stack in a Day 2018
PDF
Kafka streams - From pub/sub to a complete stream processing platform
PDF
IoT Data Platforms
PDF
Confluent and Elastic
PDF
Apache Kafka - A Distributed Streaming Platform
Message Driven and Event Sourcing
Confluent and Elastic: a Lovely Couple - Elastic Stack in a Day 2018
Kafka streams - From pub/sub to a complete stream processing platform
IoT Data Platforms
Confluent and Elastic
Apache Kafka - A Distributed Streaming Platform

Recently uploaded (20)

PDF
top salesforce developer skills in 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Nekopoi APK 2025 free lastest update
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
System and Network Administration Chapter 2
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
top salesforce developer skills in 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Wondershare Filmora 15 Crack With Activation Key [2025
wealthsignaloriginal-com-DS-text-... (1).pdf
L1 - Introduction to python Backend.pptx
Nekopoi APK 2025 free lastest update
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Operating system designcfffgfgggggggvggggggggg
Computer Software and OS of computer science of grade 11.pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Digital Systems & Binary Numbers (comprehensive )
System and Network Administration Chapter 2
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Which alternative to Crystal Reports is best for small or large businesses.pdf
Digital Strategies for Manufacturing Companies
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)

Introduction to apache kafka, confluent and why they matter

  • 1. 11 Introduction to Apache Kafka and Confluent ... and why they matter! Kafka Meetup - Johannesburg Tuesday, March 20th 2018 18:00 – 20:00 SSA - Maxwell Office Park, Magwa Cres, Waterfall City, Midrand, 2090 · Midrand https://www.meetup.com/Johannesburg-Kafka-Meetup/events/248465767/
  • 2. 22 How Organizations Handle Data Flows: a Giant Mess Data Warehouse Hadoop NoSQL Oracle SFDC Logging Bloomberg …any sink/source Web Custom Apps Microservices Monitoring Analytics …and more OLTP ActiveMQ App App Caches OLTP OLTPAppAppApp
  • 3. 33 Apache Kafka™: A Distributed Streaming Platform Apache Kafka Offline Batch (+1 Hour)Near-Real Time (>100s ms)Real Time (0-100 ms) Data Warehouse Hadoop NoSQL Oracle SFDC Twitter Bloomberg …any sink/source …any sink/source …and more Web Custom Apps Microservices Monitoring Analytics
  • 4. 44 More than 1 petabyte of data in Kafka Over 1.2 trillion messages per day Thousands of data streams Source of all data warehouse & Hadoop data Over 300 billion user- related events per day
  • 5. 55 Over 35% of Fortune 500’s are using Apache Kafka™ 6 of top 10 Travel 7 of top 10 Global banks 8 of top 10 Insurance 9 of top 10 Telecom
  • 6. 66 Industry Trends… and why Apache Kafka matters! 1. From ‘big data’ (batch) to ‘fast data’ (stream processing) 2. Internet of Things (IoT) and sensor data 3. Microservices and asynchronous communication (coordination messages and data streams) between loosely coupled and fine- grained services
  • 7. 77 Apache Kafka APIs – A UNIX Analogy $ cat < in.txt | grep "apache" | tr a-z A-Z > out.txt Connect APIs Streams APIs Producer / Consumer APIs
  • 8. 88 Apache Kafka API – ETL Analogy Source SinkConnectAPI ConnectAPI Streams API Extract Transform Load
  • 9. 99 Apache Kafka 101 Internals and Core Concepts
  • 10. 1010 Apache Kafka Concepts: Persistent Log Data Producer 0 1 2 3 4 5 6 7 8 9 10 11 12 writes Data Consumer (offset = 7) Data Consumer (offset = 11) reads reads
  • 11. 1111 Apache Kafka Concepts: Anatomy of a Topic 0 1 2 3 4 5 6 7 8 9 10 11 12partition 0 0 1 2 3 4 5 6 7 40 1 2 3 5 partition 1 partition 2 writes
  • 12. 1212 Apache Kafka Concepts: Log Storage offset index timestamp index offsets: 0 - 10000 offset index timestamp index offsets: 10001 - 20000 offset index timestamp index offsets: 20001 - 30000
  • 13. 1313 Apache Kafka Concepts: Message Format 8 bytes 4 bytes 4 bytes 8 bytes 4 bytes varies 4 bytes varies offset length CRC timesta mp key length value length key content value content magic byte 1 byte attribute 1 byte
  • 14. 1414 Apache Kafka Concepts: Producers and Consumers Producer Producer Producer Consumer Consumer Broker Broker Broker
  • 15. 1515 Apache Kafka Concepts: Topics and Partitions Producer Producer Producer Consumer Consumer Broker Broker Broker T0: P0 T0: P2 T0: P1 T0: P3 T1: P0 T1: P1
  • 16. 1616 Apache Kafka Concepts: Fault Tolerance and Replication Producer Producer Producer Consumer Consumer Broker Broker Broker T0: P0 T0: P0 (Replica 1) T1: P0 T1: P0 (Replica 1)
  • 17. 1717 Apache Kafka Concepts: Consumer Groups Producer Producer Producer Consumer Broker Broker Broker T0: P0 T0: P2 T0: P1 T0: P3 T1: P0 T1: P1 Consumer Consumer Consumer
  • 18. 1818 The Connect API of Apache Kafka®  Centralized management and configuration  Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3  Supports CDC ingest of events from RDBMS  Preserves data schema  Fault tolerant and automatically load balanced  Extensible API  Single Message Transforms  Part of Apache Kafka, included in Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/
  • 19. 1919 Build Applications, not Clusters <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>1.0.0</version> </dependency>
  • 21. 2121 How do I run in production?
  • 22. 2222 How do I run in production? Uncool Cool
  • 23. 2323 How do I run in production? http://docs.confluent.io/current/streams/introduction.html
  • 27. 2727 Typical High Level Architecture Real-time Data Ingestion
  • 28. 2828 Typical High Level Architecture Stream Processing Real-time Data Ingestion
  • 29. 2929 Typical High Level Architecture Stream Processing Storage Real-time Data Ingestion
  • 30. 3030 Typical High Level Architecture Data Publishing / Visualization Stream Processing Storage Real-time Data Ingestion
  • 31. 3131 How many clusters do you count? NoSQL (Cassandra, HBase, Couchbase, MongoDB, …) or Elasticsearch, Solr, … Storm, Flink, Spark Streaming, Ignite, Akka Streams, Apex, … HDFS, NFS, Ceph, GlusterFS, Lustre, ... Apache Kafka
  • 32. 3232 Simplicity is the Ultimate Sophistication Node.js Apache Kafka Distributed Streaming Platform Publish & Subscribe to streams of data like a messaging system Store streams of data safely in a distributed replicated cluster Process streams of data efficiently and in real-time
  • 33. 3333 Duality of Streams and Tables http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
  • 34. 3434 Duality of Streams and Tables http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
  • 38. 3838 WordCount (and Java 8+) WordCountLambdaExample.java final Properties streamsConfiguration = new Properties(); streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-lambda-example"); streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers); ... final Serde<String> stringSerde = Serdes.String(); final Serde<Long> longSerde = Serdes.Long(); final KStreamBuilder builder = new KStreamBuilder(); final KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "TextLinesTopic"); final Pattern pattern = Pattern.compile("W+", Pattern.UNICODE_CHARACTER_CLASS); final KTable<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase()))) .groupBy((key, word) -> word) .count("Counts"); wordCounts.to(stringSerde, longSerde, "WordsWithCountsTopic"); final KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration); streams.cleanUp(); streams.start(); Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
  • 39. 3939 Easy to Develop with, Easy to Test WordCountLambdaIntegrationTest.java EmbeddedSingleNodeKafkaCluster CLUSTER = new EmbeddedSingleNodeKafkaCluster(); ... CLUSTER.createTopic(inputTopic); ... Properties producerConfig = new Properties(); producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, CLUSTER.bootstrapServers());
  • 40. 4040 The Streams API of Apache Kafka®  No separate processing cluster required  Develop on Mac, Linux, Windows  Deploy to containers, VMs, bare metal, cloud  Powered by Kafka: elastic, scalable, distributed, battle-tested  Perfect for small, medium, large use cases  Fully integrated with Kafka security  Exactly-once processing semantics  Part of Apache Kafka, included in Confluent Open Source Write standard Java applications and microservices to process your data in real-time KStream<User, PageViewEvent> pageViews = builder.stream("pageviews-topic"); KTable<Windowed<User>, Long> viewsPerUserSession = pageViews .groupByKey() .count(SessionWindows.with(TimeUnit.MINUTES.toMillis(5)), "session-views"); https://docs.confluent.io/current/streams/
  • 41. 4141 KSQL: a Streaming SQL Engine for Apache Kafka® from Confluent  No coding required, all you need is SQL  No separate processing cluster required  Powered by Kafka: elastic, scalable, distributed, battle-tested CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.userid WHERE u.level = 'Platinum'; KSQL is the simplest way to process streams of data in real-time  Perfect for streaming ETL, anomaly detection, event monitoring, and more  Part of Confluent Open Source https://github.com/confluentinc/ksql
  • 42. Do you think that’s a table you are querying ?
  • 43. 4343 KSQL in less than 5 minutes https://www.youtube.com/watch?v=A45uRzJiv7I
  • 44. 4444 Confluent Enterprise: Logical Architecture Kafka Cluster Mainframe Kafka Connect Servers Kafka ConnectRDBMS Hadoop Cassandra Elasticsearch Kafka Connect Servers Kafka Connect Files Producer Application Consumer ApplicationZookeeper Kafka Broker REST Proxy Servers REST Proxy REST Client Control Center Servers Control Center Schema Registry Servers Schema Registry Kafka Producer APIs Kafka Consumer APIs Stream Processing Application 1 Stream Client Stream Processing Application 2 Stream Client REST Proxy Servers REST Proxy REST Client
  • 45. 4545 Confluent Enterprise: Physical Architecture Rack 1 Kafka Broker #1 ToR Switch ToR Switch Schema Registry #1 Kafka Connect #1 Zookeeper #1 REST Proxy #1 Kafka Broker #4 Zookeeper #4 Rack 2 Kafka Broker #2 ToR Switch ToR Switch Schema Registry #2 Kafka Connect #2 Zookeeper #2 Kafka Broker #5 Zookeeper #5 Rack 3 Kafka Broker #3 ToR Switch ToR Switch Kafka Connect #3 Zookeeper #3 Core Switch Core Switch REST Proxy #2 Load Balancer Load Balancer Control Center #1 Control Center #2
  • 46. 4646 Confluent Completes Kafka Feature Benefit Apache Kafka Confluent Open Source Confluent Enterprise Apache Kafka High throughput, low latency, high availability, secure distributed streaming platform Kafka Connect API Advanced API for connecting external sources/destinations into Kafka Kafka Streams API Simple library that enables streaming application development within the Kafka framework Additional Clients Supports non-Java clients; C, C++, Python, .NET and several others REST Proxy Provides universal access to Kafka from any network connected device via HTTP Schema Registry Central registry for the format of Kafka data – guarantees all data is always consumable Pre-Built Connectors HDFS, JDBC, Elasticsearch, Amazon S3 and other connectors fully certified and supported by Confluent JMS Client Support for legacy Java Message Service (JMS) applications consuming and producing directly from Kafka Confluent Control Center Enables easy connector management, monitoring and alerting for a Kafka cluster Auto Data Balancer Rebalancing data across cluster to remove bottlenecks Replicator Multi-datacenter replication simplifies and automates MDC Kafka clusters Support Enterprise class support to keep your Kafka environment running at top performance Community Community 24x7x365
  • 47. 4747 Big Data and Fast Data Ecosystems Synchronous Req/Response 0 – 100s ms Near Real Time > 100s ms Offline Batch > 1 hour Apache Kafka Stream Data Platform Search RDBMS Apps Monitoring Real-time Analytics NoSQL Stream Processing Apache Hadoop Data Lake Impala DWH Hive Spark Map-Reduce Confluent HDFS Connector (exactly once semantics) https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/
  • 48. 4848 Building a Microservices Ecosystem with Kafka Streams and KSQL https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/ https://github.com/confluentinc/kafka-streams-examples/tree/3.3.0-post/src/main/java/io/confluent/examples/streams/microservices
  • 49. 4949 Microservices: References Blog posts series: Part 1: The Data Dichotomy: Rethinking the Way We Treat Data and Services https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and-services/ Part 2: Build Services on a Backbone of Events https://www.confluent.io/blog/build-services-backbone-events/ Part 3: Using Apache Kafka as a Scalable, Event-Driven Backbone for Service Architectures https://www.confluent.io/blog/apache-kafka-for-service-architectures/ Part 4: Chain Services with Exactly Once Guarantees https://www.confluent.io/blog/chain-services-exactly-guarantees/ Part 5: Messaging as the Single Source of Truth https://www.confluent.io/blog/messaging-single-source-truth/ Part 6: Leveraging the Power of a Database Unbundled https://www.confluent.io/blog/leveraging-power-database-unbundled/ Part 7: Building a Microservices Ecosystem with Kafka Streams and KSQL https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/ Whitepaper: Microservices in the Apache Kafka™ Ecosystem https://www.confluent.io/resources/microservices-in-the-apache-kafka-ecosystem/
  • 50. 5050 Apache Kafka Security Security • Processes customer data • Regulatory requirements • Legal compliance • Internal security policies • Need is not limited to industries such as finance, healthcare, or governmental services Authentication • Scenario example: “Only certain applications may talk to the production Kafka cluster” • Client authentication via SASL – e.g. Kerberos, Active Directory Authorization • Scenario example: “Only certain applications may read data from sensitive Kafka topics” • Restrict who can create, write to, read from topics, and more Encryption • Scenario example: “Data-in-transit between apps and Kafka clusters must be encrypted” • SSL supported • Encrypts data exchanged between Kafka brokers, between Kafka brokers and Kafka clients/apps Help meeting security requirements by supporting:
  • 51. 5151 Enterprise Ready Multi-Datacenter Replication for Kafka Data Center in USA Kafka Cluster (USA) Kafka Broker 1 Kafka Broker 2 Kafka Broker 3 ZooKeeper 1 ZooKeeper 2 ZooKeeper 3 Control Center Kafka Connect Cluster Replicator 1 Replicator 2 Data Center in EMEA Kafka Cluster (EU) Kafka Broker 1 Kafka Broker 2 Kafka Broker 3 ZooKeeper 1 ZooKeeper 2 ZooKeeper 3 Control Center Kafka Connect Cluster Replicator 1 Replicator 2 Available only with Confluent Enterprise Apache Kafka and Confluent Open Source
  • 52. 5252 Cloud Synchronization and Migrations with Confluent Enterprise: Before DC1 DB2 DB1 DWH App2 App3 App4 KV2KV3 DB3 App2-v2 App5 App7 App1-v2 AWS App8 DWH App1 Challenges • Each team/department must execute their own cloud migration • May be moving the same data multiple times • Each box represented here require development, testing, deployment, monitoring and maintenance KV
  • 53. 5353 DC1 Cloud Synchronization and Migrations with Confluent Enterprise: After DB2 DB1 KV DWH App2 App4 KV2KV3 App2-v2 App5 App7 App1-v2 AWS App8 DWH App1 Kafka Kafka App3 Benefits • Continuous low-latency synchronization • Centralized manageability and monitoring – Track at event level data produced in all data centers • Security and governance – Track and control where data comes from and who is accessing it • Cost Savings – Move Data Once DB3
  • 54. 5454 About Confluent and Apache Kafka™ 70% of active Kafka Committers Founded September 2014 Technology developed while at LinkedIn Founded by the creators of Apache Kafka
  • 55. 5555 Apache Kafka: PMC members and committers https://kafka.apache.org/committers PMC PMC PMC PMCPMC PMC PMC PMC PMC PMC PMC
  • 56. 5656 Download Confluent Platform: the easiest way to get you started https://www.confluent.io/download/
  • 57. 5757 Books: get them all three in PDF format from Confluent website! https://www.confluent.io/apache-kafka-stream-processing-book-bundle
  • 58. 5858 Discount code: kacom17 Presented by https://kafka-summit.org/ Presented by