PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.
Apache Kafka is a distributed publish-subscribe messaging system that can handle high volumes of data and enable messages to be passed from one endpoint to another. It uses a distributed commit log that allows messages to be persisted on disk for durability. Kafka is fast, scalable, fault-tolerant, and guarantees zero data loss. It is used by companies like LinkedIn, Twitter, and Netflix to handle high volumes of real-time data and streaming workloads.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella
Kafka is a scalable, distributed publish subscribe messaging system that's used as a data transmission backbone in many data intensive digital businesses. Couchbase Server is a scalable, flexible document database that's fast, agile, and elastic. Because they both appeal to the same type of customers, Couchbase and Kafka are often used together.
This presentation from a meetup in Mountain View describes Kafka's design and why people use it, Couchbase Server and its uses, and the use cases for both together. Also covered is a description and demo of Couchbase Server writing documents to a Kafka topic and consuming messages from a Kafka topic. using the Couchbase Kafka Connector.
This presentation provides an introduction to Apache Kafka and describes best practices for working with fast data streams in Kafka and MapR Streams.
The code examples used during this talk are available at github.com/iandow/design-patterns-for-fast-data.
Author:
Ian Downard
Presented at the Portland Java User Group on Tuesday, October 18 2016.
Introduction to Apache Kafka and why it matters - MadridPaolo Castagna
This document provides an introduction to Apache Kafka and discusses why it is an important distributed streaming platform. It outlines how Kafka can be used to handle streaming data flows in a reliable and scalable way. It also describes the various Apache Kafka APIs including Kafka Connect, Streams API, and KSQL that allow organizations to integrate Kafka with other systems and build stream processing applications.
Kafka, Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform (Kafka Core + Kafka Connect + Kafka Streams) for building streaming data pipelines and streaming data applications.
This talk, that I gave at the Chicago Java Users Group (CJUG) on June 8th 2017, is mainly focusing on Kafka Streams, a lightweight open source Java library for building stream processing applications on top of Kafka using Kafka topics as input/output.
You will learn more about the following:
1. Apache Kafka: a Streaming Data Platform
2. Overview of Kafka Streams: Before Kafka Streams? What is Kafka Streams? Why Kafka Streams? What are Kafka Streams key concepts? Kafka Streams APIs and code examples?
3. Writing, deploying and running your first Kafka Streams application
4. Code and Demo of an end-to-end Kafka-based Streaming Data Application
5. Where to go from here?
This document provides an introduction to Apache Kafka. It describes Kafka as a distributed messaging system with features like durability, scalability, publish-subscribe capabilities, and ordering. It discusses key Kafka concepts like producers, consumers, topics, partitions and brokers. It also summarizes use cases for Kafka and how to implement producers and consumers in code. Finally, it briefly outlines related tools like Kafka Connect and Kafka Streams that build upon the Kafka platform.
This document provides guidance on scaling Apache Kafka clusters and tuning performance. It discusses expanding Kafka clusters horizontally across inexpensive servers for increased throughput and CPU utilization. Key aspects that impact performance like disk layout, OS tuning, Java settings, broker and topic monitoring, client tuning, and anticipating problems are covered. Application performance can be improved through configuration of batch size, compression, and request handling, while consumer performance relies on partitioning, fetch settings, and avoiding perpetual rebalances.
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
This document discusses disaster recovery strategies for Apache Kafka clusters running across multiple data centers. It outlines several failure scenarios like an entire data center being demolished and recommends solutions like running a single Kafka cluster across multiple near-by data centers. It then describes a "stretch cluster" approach using 3 data centers with replication between them to provide high availability. The document also discusses active-active replication between two data center clusters and challenges around consumer offsets not being identical across data centers during a failover. It recommends approaches like tracking timestamps and failing over consumers based on time.
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
Chris Curtin gave a presentation on Apache Kafka at the Atlanta Java Users Group. He discussed his background in technology and current role at Silverpop. He then provided an overview of Apache Kafka, describing its core functionality as a distributed publish-subscribe messaging system. Finally, he demonstrated how producers and consumers interact with Kafka and highlighted some use cases and performance figures from LinkedIn's deployment of Kafka.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
How Apache Kafka is transforming Hadoop, Spark and StormEdureka!
This document provides an overview of Apache Kafka and how it is transforming Hadoop, Spark, and Storm. It begins with explaining why Kafka is needed, then defines what Kafka is and describes its architecture. Key components of Kafka like topics, producers, consumers and brokers are explained. The document also shows how Kafka can be used with Hadoop, Spark, and Storm for stream processing. It lists some companies that use Kafka and concludes by advertising an Edureka course on Apache Kafka.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
Apache Kafka is a distributed publish-subscribe messaging system that was originally created by LinkedIn and contributed to the Apache Software Foundation. It is written in Scala and provides a multi-language API to publish and consume streams of records. Kafka is useful for both log aggregation and real-time messaging due to its high performance, scalability, and ability to serve as both a distributed messaging system and log storage system with a single unified architecture. To use Kafka, one runs Zookeeper for coordination, Kafka brokers to form a cluster, and then publishes and consumes messages with a producer API and consumer API.
The document discusses designing robust data architectures for decision making. It advocates for building architectures that can easily add new data sources, improve and expand analytics, standardize metadata and storage for easy data access, discover and recover from mistakes. The key aspects discussed are using Kafka as a data bus to decouple pipelines, retaining all data for recovery and experimentation, treating the filesystem as a database by storing intermediate data, leveraging Spark and Spark Streaming for batch and stream processing, and maintaining schemas for integration and evolution of the system.
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
Apache Kafka is a distributed streaming platform. It provides a high-throughput distributed messaging system that can handle trillions of events daily. Many large companies use Kafka for application logging, metrics collection, and powering real-time analytics. The current version is 0.8.2 and upcoming versions will include a new consumer, security features, and support for transactions.
Apache Kafka is a high-throughput distributed messaging system that can be used for building real-time data pipelines and streaming apps. It provides a publish-subscribe messaging model and is designed as a distributed commit log. Kafka allows for both push and pull models where producers push data and consumers pull data from topics which are divided into partitions to allow for parallelism.
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...Erik Onnen
The document discusses Urban Airship's use of Apache Kafka for processing continuous data streams. It describes how Urban Airship uses Kafka for analytics, operational data, and presence data. Producers write device data to Kafka topics, and consumers create indexes from the data in databases like HBase and write to operational data warehouses. The document also covers Kafka concepts, best use cases, limitations, and examples of data structures for storing device metadata in Kafka streams.
This document provides an overview of Kafka and Spark Streaming. It discusses key concepts of Kafka like brokers, topics, producers and consumers. It also explains how Spark Streaming works using micro-batches and time window APIs. Finally, it discusses reliability aspects and references for further reading.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Building Event-Driven Systems with Apache KafkaBrian Ritchie
Event-driven systems provide simplified integration, easy notifications, inherent scalability and improved fault tolerance. In this session we'll cover the basics of building event driven systems and then dive into utilizing Apache Kafka for the infrastructure. Kafka is a fast, scalable, fault-taulerant publish/subscribe messaging system developed by LinkedIn. We will cover the architecture of Kafka and demonstrate code that utilizes this infrastructure including C#, Spark, ELK and more.
Sample code: https://github.com/dotnetpowered/StreamProcessingSample
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
Kafka c’est un peu la nouvelle star sur la scène des files de messages. Pourtant Kafka ne se présente pas en tant que tel, c’est un log distribué !
Alors qu’est ce que c’est ? Comment ça marche ? Et surtout comment et pourquoi je l’utilise ?
Dans cette session, on décortique la bête pour tout vous expliquer ! Au programme : des concepts, des cas d’usage, du streaming et un retour d’expérience !
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedEdureka!
This document provides an overview of Apache Kafka and how it can be used for real-time analytics with Spark Streaming. It begins with an agenda that outlines what will be covered, including what Kafka is, why it is needed, its components, how it works, examples of companies using it, and a hands-on demonstration of integrating Kafka with Spark. The document then discusses why Kafka was developed, how it works, its performance capabilities, and how it can be used with Spark Streaming for real-time analytics by ingesting data, performing analysis, and displaying or storing results.
In this Kafka Tutorial, we will discuss Kafka Architecture. In this Kafka Architecture article, we will see API’s in Kafka. Moreover, we will learn about Kafka Broker, Kafka Consumer, Zookeeper, and Kafka Producer. Also, we will see some fundamental concepts of Kafka.
This document provides an introduction to Apache Kafka. It describes Kafka as a distributed messaging system with features like durability, scalability, publish-subscribe capabilities, and ordering. It discusses key Kafka concepts like producers, consumers, topics, partitions and brokers. It also summarizes use cases for Kafka and how to implement producers and consumers in code. Finally, it briefly outlines related tools like Kafka Connect and Kafka Streams that build upon the Kafka platform.
This document provides guidance on scaling Apache Kafka clusters and tuning performance. It discusses expanding Kafka clusters horizontally across inexpensive servers for increased throughput and CPU utilization. Key aspects that impact performance like disk layout, OS tuning, Java settings, broker and topic monitoring, client tuning, and anticipating problems are covered. Application performance can be improved through configuration of batch size, compression, and request handling, while consumer performance relies on partitioning, fetch settings, and avoiding perpetual rebalances.
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
This document discusses disaster recovery strategies for Apache Kafka clusters running across multiple data centers. It outlines several failure scenarios like an entire data center being demolished and recommends solutions like running a single Kafka cluster across multiple near-by data centers. It then describes a "stretch cluster" approach using 3 data centers with replication between them to provide high availability. The document also discusses active-active replication between two data center clusters and challenges around consumer offsets not being identical across data centers during a failover. It recommends approaches like tracking timestamps and failing over consumers based on time.
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
Chris Curtin gave a presentation on Apache Kafka at the Atlanta Java Users Group. He discussed his background in technology and current role at Silverpop. He then provided an overview of Apache Kafka, describing its core functionality as a distributed publish-subscribe messaging system. Finally, he demonstrated how producers and consumers interact with Kafka and highlighted some use cases and performance figures from LinkedIn's deployment of Kafka.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
How Apache Kafka is transforming Hadoop, Spark and StormEdureka!
This document provides an overview of Apache Kafka and how it is transforming Hadoop, Spark, and Storm. It begins with explaining why Kafka is needed, then defines what Kafka is and describes its architecture. Key components of Kafka like topics, producers, consumers and brokers are explained. The document also shows how Kafka can be used with Hadoop, Spark, and Storm for stream processing. It lists some companies that use Kafka and concludes by advertising an Edureka course on Apache Kafka.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
Apache Kafka is a distributed publish-subscribe messaging system that was originally created by LinkedIn and contributed to the Apache Software Foundation. It is written in Scala and provides a multi-language API to publish and consume streams of records. Kafka is useful for both log aggregation and real-time messaging due to its high performance, scalability, and ability to serve as both a distributed messaging system and log storage system with a single unified architecture. To use Kafka, one runs Zookeeper for coordination, Kafka brokers to form a cluster, and then publishes and consumes messages with a producer API and consumer API.
The document discusses designing robust data architectures for decision making. It advocates for building architectures that can easily add new data sources, improve and expand analytics, standardize metadata and storage for easy data access, discover and recover from mistakes. The key aspects discussed are using Kafka as a data bus to decouple pipelines, retaining all data for recovery and experimentation, treating the filesystem as a database by storing intermediate data, leveraging Spark and Spark Streaming for batch and stream processing, and maintaining schemas for integration and evolution of the system.
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
Apache Kafka is a distributed streaming platform. It provides a high-throughput distributed messaging system that can handle trillions of events daily. Many large companies use Kafka for application logging, metrics collection, and powering real-time analytics. The current version is 0.8.2 and upcoming versions will include a new consumer, security features, and support for transactions.
Apache Kafka is a high-throughput distributed messaging system that can be used for building real-time data pipelines and streaming apps. It provides a publish-subscribe messaging model and is designed as a distributed commit log. Kafka allows for both push and pull models where producers push data and consumers pull data from topics which are divided into partitions to allow for parallelism.
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...Erik Onnen
The document discusses Urban Airship's use of Apache Kafka for processing continuous data streams. It describes how Urban Airship uses Kafka for analytics, operational data, and presence data. Producers write device data to Kafka topics, and consumers create indexes from the data in databases like HBase and write to operational data warehouses. The document also covers Kafka concepts, best use cases, limitations, and examples of data structures for storing device metadata in Kafka streams.
This document provides an overview of Kafka and Spark Streaming. It discusses key concepts of Kafka like brokers, topics, producers and consumers. It also explains how Spark Streaming works using micro-batches and time window APIs. Finally, it discusses reliability aspects and references for further reading.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Building Event-Driven Systems with Apache KafkaBrian Ritchie
Event-driven systems provide simplified integration, easy notifications, inherent scalability and improved fault tolerance. In this session we'll cover the basics of building event driven systems and then dive into utilizing Apache Kafka for the infrastructure. Kafka is a fast, scalable, fault-taulerant publish/subscribe messaging system developed by LinkedIn. We will cover the architecture of Kafka and demonstrate code that utilizes this infrastructure including C#, Spark, ELK and more.
Sample code: https://github.com/dotnetpowered/StreamProcessingSample
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
Kafka c’est un peu la nouvelle star sur la scène des files de messages. Pourtant Kafka ne se présente pas en tant que tel, c’est un log distribué !
Alors qu’est ce que c’est ? Comment ça marche ? Et surtout comment et pourquoi je l’utilise ?
Dans cette session, on décortique la bête pour tout vous expliquer ! Au programme : des concepts, des cas d’usage, du streaming et un retour d’expérience !
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedEdureka!
This document provides an overview of Apache Kafka and how it can be used for real-time analytics with Spark Streaming. It begins with an agenda that outlines what will be covered, including what Kafka is, why it is needed, its components, how it works, examples of companies using it, and a hands-on demonstration of integrating Kafka with Spark. The document then discusses why Kafka was developed, how it works, its performance capabilities, and how it can be used with Spark Streaming for real-time analytics by ingesting data, performing analysis, and displaying or storing results.
In this Kafka Tutorial, we will discuss Kafka Architecture. In this Kafka Architecture article, we will see API’s in Kafka. Moreover, we will learn about Kafka Broker, Kafka Consumer, Zookeeper, and Kafka Producer. Also, we will see some fundamental concepts of Kafka.
This document discusses new age distributed messaging using Apache Kafka. It begins with an introduction to Kafka concepts like topics, partitions, producers and consumers. It then explains how Kafka uses commit log architecture and an append-only log structure to provide high throughput performance. The document also covers how Zookeeper is used to coordinate Kafka brokers and keep metadata. It evaluates Kafka's performance based on LinkedIn benchmarks, finding that its lack of acknowledgements, batching and storage format allow for very fast publishing and consumption of messages. In conclusion, the document suggests Kafka could be introduced in some parts of Responsys' architecture to handle big data workloads.
Kafka can be used to build real-time streaming applications and process large amounts of data. It provides a simple publish-subscribe messaging model with streams of records. Kafka Connect allows connecting Kafka with other data systems and formats through reusable connectors. Kafka Streams provides a streaming library to allow building streaming applications and processing data in Kafka streams through operators like map, filter and windowing.
This document provides an introduction to Apache Kafka. It discusses why Kafka is needed for real-time streaming data processing and real-time analytics. It also outlines some of Kafka's key features like scalability, reliability, replication, and fault tolerance. The document summarizes common use cases for Kafka and examples of large companies that use it. Finally, it describes Kafka's core architecture including topics, partitions, producers, consumers, and how it integrates with Zookeeper.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Data Con LA
Abstract:- Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing. In this talk you will learn more about: A quick introduction to Kafka Core, Kafka Connect and Kafka Streams through code examples, key concepts and key features. A reference architecture for building such Kafka-based streaming data applications. A demo of an end-to-end Kafka-based streaming data application.
This document provides an overview of Apache Kafka. It discusses Kafka's key capabilities including publishing and subscribing to streams of records, storing streams of records durably, and processing streams of records as they occur. It describes Kafka's core components like producers, consumers, brokers, and clustering. It also outlines why Kafka is useful for messaging, storing data, processing streams in real-time, and its high performance capabilities like supporting multiple producers/consumers and disk-based retention.
Beyond the brokers - A tour of the Kafka ecosystemDamien Gasparina
Beyond the brokers - A tour of the Kafka ecosystem. Presentation done the 28/03/2019 (Lyon JUG: https://www.meetup.com/Lyon-Java-User-Group-LyonJUG/events/259569434/)
Beyond the Brokers: A Tour of the Kafka Ecosystemconfluent
This document provides an overview of the Kafka ecosystem. It discusses how Kafka can be used to ingest, process, and analyze massive amounts of structured and unstructured data from various sources in real-time. It describes how Kafka Connect can be used to easily integrate Kafka with other data sources and sinks, how clients in various languages can communicate with Kafka, and how stream processing tools like Kafka Streams and KSQL can be used to analyze streaming data using simple SQL-like queries. It also discusses how the Kafka ecosystem addresses challenges like schema management, deployment on Kubernetes, and more.
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing.
In this talk you will learn more about:
1. A quick introduction to Kafka Core, Kafka Connect and Kafka Streams: What is and why?
2. Code and step-by-step instructions to build an end-to-end streaming data application using Apache Kafka
What is Apache Kafka and What is an Event Streaming Platform?confluent
Speaker: Gabriel Schenker, Lead Curriculum Developer, Confluent
Streaming platforms have emerged as a popular, new trend, but what exactly is a streaming platform? Part messaging system, part Hadoop made fast, part fast ETL and scalable data integration. With Apache Kafka® at the core, event streaming platforms offer an entirely new perspective on managing the flow of data. This talk will explain what an event streaming platform such as Apache Kafka is and some of the use cases and design patterns around its use—including several examples of where it is solving real business problems. New developments in this area such as KSQL will also be discussed.
This document provides an overview of Apache Kafka, including its history, architecture, key concepts, use cases, and demonstrations. Kafka is a distributed streaming platform designed for high throughput and scalability. It can be used for messaging, logging, and stream processing. The document outlines Kafka's origins at LinkedIn, its differences from traditional messaging systems, and key terms like topics, producers, consumers, brokers, and partitions. It also demonstrates how Kafka handles leadership and replication across brokers.
Beyond the brokers - Un tour de l'écosystème KafkaFlorent Ramiere
Apache Kafka ne se résume pas aux brokers, il y a tout un écosystème open-source qui gravite autour. Je vous propose ainsi de découvrir les principaux composants comme Kafka Streams, KSQL, Kafka Connect, Rest proxy, Schema Registry, MirrorMaker, etc.
Intro to Apache Kafka I gave at the Big Data Meetup in Geneva in June 2016. Covers the basics and gets into some more advanced topics. Includes demo and source code to write clients and unit tests in Java (GitHub repo on the last slides).
This document provides an overview of Apache Kafka fundamentals. It discusses key Kafka concepts like producers, brokers, consumers, topics, partitions, and how they interact in the Kafka architecture. The session also covers how Kafka uses ZooKeeper for coordination and configuration, and how producers, consumers, and brokers work to provide a scalable streaming platform.
Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.
Here are the slides for Greenplum Chat #8. You can view the replay here: https://www.youtube.com/watch?v=FKFiyJDgdQk
The increased frequency and sophistication of high-profile data breaches and malicious hacking is putting organizations at continued risk of data theft and significant business disruption. Complicating this scenario is the unbounded growth of Big Data and petabyte-scale data storage, new open source database and distribution schemes, and the continued adoption of cloud services by enterprises.
Pivotal Greenplum customers often look for additional encryption of data-at-rest and data-in-motion. The massively parallel processing (MPP) architecture of Pivotal Greenplum provides an architecture that is unlike traditional OLAP on RDBMS for data warehousing, and encryption capabilities must address the scale-out architecture.
The Zettaset Big Data Encryption Suite has been designed for optimal performance and scalability in distributed Big Data systems like Greenplum Database and Apache HAWQ.
Here is a replay of our recent Greenplum Chat with Zettaset:
00:59 What is Greenplum’s approach for encryption and why Zettaset?
02:17 Results of field testing Zettaset with Greenplum
03:50 Introduction to Zettaset, the security company
05:36 Overview of Zettaset and their solutions
14:51 Different layers for encrypting data at rest
16:50 Encryption key management for big data
20:51 Zettaset BD Encrypt for data at rest and data in motion
22:19 How to mitigate encryption overhead with an MPP scale-out system
24:12 How to deploy BD Encrypt
25:50 Deep dive on data at rest encryption
30:44 Deep dive on data in motion encryption
36:72 Q: How does Zettaset deal with encrypting Greenplums multiple interfaces?
38:08 Q: Can I encrypt data for a particular column?
40:26 How Zettaset fits into a security strategy
41:21 Q: What is the performance impact on queries by encrypting the entire database?
43:28 How Zettaset helps Greenplum meet IT compliance requirements
45:12 Q: How authentication for keys is obtained
48:50 Q: How can Greenplum users try out Zettaset?
50:53 Q: What is a ‘Zettaset Security Coach’?
Hear are the details of the new security framework for Apache Geode, based on Apache Shiro. Watch the video at: https://youtu.be/AhUPT3wfAMM
How to use the WAN Gateway feature of Apache Geode to implement multi-site and active-active failover, disaster recovery, and global scale applications.
#GeodeSummit: Easy Ways to Become a Contributor to Apache GeodePivotalOpenSourceHub
The document provides steps for becoming a contributor to the Apache Geode project, beginning with joining online conversations about the project, then test-driving it by building and running examples, and finally improving the project by reporting findings, fixing bugs, or adding new features through submitting code. The key steps are to join mailing lists or chat forums to participate in discussions, quickly get started with the project by building and testing examples in 5 minutes, and then test release candidates and report any issues found on the project's issue tracker or documentation pages. Contributions to the codebase are also welcomed by forking the GitHub repository and submitting pull requests with bug fixes or new features.
#GeodeSummit Keynote: Creating the Future of Big Data Through 'The Apache Way"PivotalOpenSourceHub
Keynote at Geode Summit 2016 by Dr. Justin Erenkrantz, Bloolmberg LP. Creating the Future of Big Data Through "The Apache Way" and why this matters to the community
#GeodeSummit: Combining Stream Processing and In-Memory Data Grids for Near-R...PivotalOpenSourceHub
This document discusses combining stream processing and in-memory data grids for near-real-time aggregation and notifications. It describes storing immutable event data and filtering and aggregating events in real-time based on requested perspectives. Perspectives can be requested at any time for historical or real-time event data. The solution aims to be scalable, resilient, and low latency using Apache Storm for stream processing, Apache Geode for the event log and storage, and deployment patterns to collocate them for better performance.
In this session we review the design of the newly released off heap storage feature in Apache Geode, and discuss use cases and potential direction for additional capabilities of this feature.
This document discusses implementing a Redis adaptor using Apache Geode. It provides an overview of Redis data structures and commands, describes how Geode partitioned regions and indexes can be used to store and access Redis data, outlines advantages like scalability and high availability, and presents a roadmap for further development including supporting additional commands and performance optimization.
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & GeodePivotalOpenSourceHub
In this session we review the design of the current state of support for Apache Geode by Spring Cloud Data Flow, and explore additional use cases and future direction that Spring Cloud Data Flow and Apache Geode might evolve.
In this session we review the design of the current capabilities of the Spring Data GemFire API that supports Geode, and explore additional use cases and future direction that the Spring API and underlying Geode support might evolve.
#GeodeSummit - Modern manufacturing powered by Spring XD and GeodePivotalOpenSourceHub
This document summarizes a presentation about how TEKsystems Global Services helps modern manufacturing industries address challenges through big data solutions. It outlines TEKsystems' services and capabilities, as well as real-world applications for manufacturing, financial services, and life sciences. The presentation describes reference architectures and customer success stories in marine seismic data and gaming industries. It positions TEKsystems as having expertise, proven track records, and packaged offerings to provide big data solutions from pilot to production.
#GeodeSummit - Using Geode as Operational Data Services for Real Time Mobile ...PivotalOpenSourceHub
One of the largest retailers in North America are considering Apache Geode for their new mobile loyalty application, to support their digital transformation effort. They would use Geode to provide operational data services for their mobile cloud service. This retailer needs to replace sluggish response times with sub-second response which will improved conversion rates. They also want to able to close the loop between data science findings and app experience. This way the right customer interaction is suggested when it is needed such as when customers are looking at their mobile app while walking in the store, or sending notifications at the individuals most likely shopping times. The final benefits of using Geode will include faster development cycles, increased customer loyalty, and higher revenue.
#GeodeSummit - Large Scale Fraud Detection using GemFire Integrated with Gree...PivotalOpenSourceHub
In this session we explore a case study of a large-scale government fraud detection program that prevents billions of dollars in fraudulent payments each year leveraging the beta release of the GemFire+Greenplum Connector, which is planned for release in GemFire 9. Topics will include an overview of the system architecture and a review of the new GemFire+Greenplum Connector features that simplify use cases requiring a blend of massively parallel database capabilities and accelerated in-memory data processing.
#GeodeSummit: Democratizing Fast Analytics with Ampool (Powered by Apache Geode)PivotalOpenSourceHub
Today, if events change the decision model, we wait until the next batch model build for new insights. By extending fast “time-to-decisions” into the world of Big Data Analytics to get fast “time-to-insights”, apps will get what used to be batch insights in near real time. The technology enabling this includes smart in-memory data storage, new storage class memory, and products designed to do one or more parts of an analysis pipeline very well. In this talk we describe how Ampool is building on Apache Geode to allow Big Data analysis solutions to work together with a scalable smart storage class memory layer to allow fast and complex end-to-end pipelines to be built -- closing the loop and providing dramatically lower time to critical insights.
#GeodeSummit: Architecting Data-Driven, Smarter Cloud Native Apps with Real-T...PivotalOpenSourceHub
This talk introduces an open-source solution that integrates cloud native apps running on Cloud Foundry with an open-source hybrid transactions + analytics real-time solution. The architecture is based on the fastest scalable, highly available and fully consistent In-Memory Data Grid (Apache Geode / GemFire), natively integrated to the first open-source massive parallel data warehouse (Greenplum Database) in a hybrid transactional and analytical architecture that is extremely fast, horizontally scalable, highly resilient and open source. This session also features a live demo running on Cloud Foundry, showing a real case of real-time closed-loop analytics and machine learning using the featured solution.
Apache Apex and Apache Geode are two of the most promising incubating open source projects. Combined, they promise to fill gaps of existing big data analytics platforms. Apache Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream and batch processing. Apex is highly scalable, performant, fault tolerant, and strong in operability. Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing. We will also look at some use cases where how these two projects can be used together to form distributed, fault tolerant, reliable in memory data processing layer.
#GeodeSummit - Where Does Geode Fit in Modern System ArchitecturesPivotalOpenSourceHub
The document discusses how Apache Geode fits into modern system architectures using the Command Query Responsibility Segregation (CQRS) pattern. CQRS separates reads and writes so that each can be optimized independently. Geode is well-suited as the read store in a CQRS system due to its ability to efficiently handle queries and cache data through regions. The document provides references on CQRS and related patterns to help understand how they can be applied with Geode.
How Southwest Airlines Uses Geode
Distributed systems and fast data require new software patterns and implementation skills. Learn how Southwest Airlines uses Apache Geode, organizes team responsibilities, and approaches design tradeoffs. Drawing inspiration from real whiteboard conversations, we’ll explore: common development pitfalls, environment capacity planning, streaming data patterns like consumer checkpointing, support roles, and production lessons learned.
Every day, Apache Geode improves how Southwest Airlines schedules nearly 4,000 flights and serves over 500,000 passengers. It’s an essential component of Southwest’s ability to reduce flight delays and support future growth.
#GeodeSummit - Wall St. Derivative Risk Solutions Using GeodePivotalOpenSourceHub
In this talk, Andre Langevin discusses how Geode forms the core of many Wall Street derivative risk solutions. By externalizing risk from trading systems, Geode-based solutions provide cross-product risk management at speeds suitable for automated hedging, while simultaneously eliminating the back office costs associated with traditional trading system based solutions.
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
Slides from the Meetup Monday March 7, 2016 just before the beginning of #GeodeSummit, where we cover an introduction of the technology and community that is Apache Geode, the in-memory data grid.
apidays New York 2025 - The Future of Small Business Lending with Open Bankin...apidays
The Future of Small Business Lending with Open Banking – Bridging the $750 Billion Funding Gap
Charles Groome, Vice President of Growth Strategy at Biz2Credit
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Boost API Development Velocity with Practical AI Tool...apidays
Boost API Development Velocity with Practical AI Tooling
Sumit Amar, VP of Engineering at WEX
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Report based on the findings of a quantitative research conducted by the research agency New Image Marketing Group, commissioned by the NGO Detector Media, compiled by PhD in Sociology Marta Naumova
METHODS OF DATA COLLECTION (Research methodology)anwesha248
The important methods of data collection along with advantages and disadvantages are published in this power point presentation.
published by- Prateeti Anwesha Bordoloi, Research Scholar
Mahapurusha Srimanta Sankaradeva Vishwavidyalaya
June 2025
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays
Using GraphQL SDL files as executable API Contracts
Hari Krishnan, Co-founder & CTO at Specmatic
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - What exactly are AI Agents by Aki Ranin (Earthshots ...apidays
What exactly are AI Agents?
Aki Ranin, Head of AI at Earthshots Collective | Deep Tech & AI Investor | 2x Founder | Published Author
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...apidays
4 identity factors you didn't know you needed to support large organizations in your SaaS
Daizen Ikehara, Principal Developer Advocate at Auth0
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)apidays
CIAM in the wild: What we learned while scaling from 1.5 to 3 million users
Michael Gruen, VP of Engineering at Layr
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...apidays
Building Scalable AI Systems: Cloud Architecture for Performance
Sai Prasad Veluru, Software Engineer at Apple Inc
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays
Breaking Barriers: Lessons Learned from API Integration with Large Hotel Chains and the Role of Standardization
Constantine Nikolaou, Manager Business Solutions Architect at Booking.com
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)apidays
Two tales of API Change Management from my time at Google
Eric Koleda, Developer Advocate at Coda
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...apidays
Enhancing Developer Productivity with UX
Petrine Tang, UX Designer at Government Technology Agency
Faith Ang, Product Manager at Government Technology Agency
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...apidays
Beyond Webhooks: The Future of Scalable API Event Delivery
Phil Leggetter, Head of Developer Experience at Hookdeck
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...apidays
Fast, Repeatable, Secure: Pick 3 with FINOS CCC
Leigh Capili, Kubernetes Contributor at Control Plane
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
4. Apache Kafka is
publish-subscribe messaging
rethought as a
distributed commit log.
turned into
a stream data platform
An Optical Illusion
5. • Write-ahead Logs
• So What is Kafka?
• Awesome use-case for Kafka
• Data streams and real-time ETL
• Where can you learn more
We’ll talk about:
6. Write-Ahead Logging (WAL)
a standard method for ensuring data
integrity… changes to data files … must
be written only after those changes
have been logged… in the event of a
crash we will be able to recover the
database using the log.
8. WAL is used for
• Recover consistent state of a database
• Replicate the database (Streaming Replication, Hot Standby)
If you look far enough into archived logs – you can reconstruct the
entire database.
11. Kafka provides a fast, distributed, highly scalable,
highly available, publish-subscribe messaging system.
Based on the tried and true log structure.
In turn this solves part of a much harder problem:
Communication and integration between
components of large software systems
12. The Basics
•Messages are organized into topics
•Producers push messages
•Consumers pull messages
•Kafka runs in a cluster. Nodes are called
brokers
18. Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in
partition
Order retained with in
partition but not over
partitionsOffSetX
OffSetX
OffSetX
OffSetYOffSetYOffSetY
Off sets are kept per
consumer group
19. Kafka “Magic” – Why is it so fast?
• 250M Events per sec on one node at 3ms latency
• Scales to any number of consumers
• Stores data for set amount of time –
Without tracking who read what data
• Replicates – but no need to sync to disk
• Zero-copy writes from memory / disk to network
20. How do people use Kafka?
• As a message bus
• As a buffer for replication systems
• As reliable feed for event processing
• As a buffer for event processing
• Decouple apps from databases
22. Raise your hand if this sounds familiar
“My next project was to get a working Hadoop setup…
Having little experience in this area, we naturally
budgeted a few weeks for getting data in and out, and
the rest of our time for implementing fancy algorithms.
“
--Jay Kreps, Kafka PMC
29. This is where we are trying to get
29
Source System Source System Source System Source System
Kafka decouples Data Pipelines
Hadoop Security Systems
Real-time
monitoring
Data Warehouse
Kafka
Producers
Brokers
Consumers
Kafka decouples Data Pipelines
30. Important notes:
• Producers and Consumers dont need to know about each other
• Performance issues on Consumers dont impact Producers
• Consumers are protected from herds of Producers
• Lots of flexibility in handling load
• Messages are available for anyone –
lots of new use cases, monitoring, audit, troubleshooting
http://www.slideshare.net/gwenshap/queues-pools-caches
31. My Favorite Use Cases
• Shops consume inventory updates
• Clicking around an online shop? Your clicks go to Kafka and
recommendations come back.
• Flagging credit card transactions as fraudulent
• Flagging game interactions as abuse
• Least favorite: Surge pricing in Uber
• Huge list of users at kafka.apache.org
31
33. Remember This?
33
Source System Source System Source System Source System
Kafka decouples Data Pipelines
Hadoop Security Systems
Real-time
monitoring
Data Warehouse
Kafka
Producers
Brokers
Consumers
Kafka is smack in middle of all Data Pipelines
34. If data flies into Kafka in real time
Why wait 24h before pulling it into a DWH?
34
36. Why Kafka makes real-time ETL better?
• Can integrate with any data source
• RDBMS, NoSQL, Applications, web applications, logs
• Consumers can be real-time
But they do not have to
• Reading and writing to/from Kafka is cheap
• So this is a great place to store intermediate state
• You can fix mistakes by rereading some of the data again
• Same data in same order
• Adding more pipelines / aggregations has no impact on source systems =
low risk
36
37. It is all valuable data
Raw data
Raw data Clean data
Aggregated dataClean data Enriched data
Filtered data
Dash
board
Report
Data
scientist
Alerts
OMG
38. • Producers
• Log4J
• Rest Proxy
• BottledWater
• KafkaConnect and its connectors ecosystem
• Other ecosystem
OK, but how does my data get into Kafka
41. • However you want:
• You just consume data, modify it, and produce it back
• Built into Kafka:
• Kprocessor
• Kstream
• Popular choices:
• Storm
• SparkStreaming
But wait, how do we process the data?
44. Need More Kafka?
• https://kafka.apache.org/documentation.html
• My video tutorial:
http://shop.oreilly.com/product/0636920038603.do
• http://www.michael-noll.com/blog/2014/08/18/apache-kafka-
training-deck-and-tutorial/
• Our website:
http://confluent.io
• Oracle guide to real-time ETL:
http://www.oracle.com/technetwork/middleware/data-
integrator/overview/best-practices-for-realtime-data-wa-132882.pdf
Editor's Notes
#14: Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
#15: Kafka retains all messages for fixed amount of time.
Not waiting for acks from consumers.
The only metadata retained per consumer is the position in the log – the offset
So adding many consumers is cheap
On the other hand, consumers have more responsibility and are more challenging to implement correctly
And “batching” consumers is not a problem
#17: The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
#18: The choose how many replicas must ACK a message before its considered committed.
This is the tradeoff between speed and reliability
#19: can read from one or more partition leader. You can’t have two consumers in same group reading the same partition.
Leaders obviously do more work – but they are balanced between nodes
We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka.
#25: Then we end up adding clients to use that source.
#26: But as we start to deploy our applications we realizet hat clients need data from a number of sources. So we add them as needed.
#28: But over time, particularly if we are segmenting services by function, we have stuff all over the place, and the dependencies are a nightmare. This makes for a fragile system.
#30: Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn and they use it as a high throughput relatively low latency commit log. It allows sources to push data without worrying about what clients are reading it. Note that producer push, and consumers pull. Kafka itself is a cluster of brokers, which handles both persisting data to disk and serving that data to consumer requests.
#34: Kafka is a pub/sub messaging system that can decouple your data pipelines. Most of you are probably familiar with it’s history at LinkedIn and they use it as a high throughput relatively low latency commit log. It allows sources to push data without worrying about what clients are reading it. Note that producer push, and consumers pull. Kafka itself is a cluster of brokers, which handles both persisting data to disk and serving that data to consumer requests.
#44: Sorry, but “Schema on Read” is kind of B.S.
We admit that there is a schema, but we want to “ingest fast”, so we shift the burden to the readers.
But the data is written once and read many many times by many different people. They each need to figure this out on their own? This makes no sense.
Also, how are you going to validate the data without a schema?
#45: https://github.com/schema-repo/schema-repoThere’s no data dictionary for Kafka
#53: There are many options for handling excessing user requests. The only thing that is not an option – throw everything at the database and let the DB queue the excessive load