SlideShare a Scribd company logo
Low-latency ingestion and analytics with
Apache Kafka and Apache Apex
Thomas Weise, Architect DataTorrent, PPMC member Apache Apex
March 28th 2016
Apache Apex Features
• In-memory Stream Processing
• Scale out, Distributed, Parallel, High Throughput
• Windowing (temporal boundary)
• Reliability, Fault Tolerance
• Operability
• YARN native
• Compute Locality
• Dynamic updates
2
Apex Platform Overview
3
Apache Apex Malhar Library
4
Apache Kafka
5
“A high-throughput distributed messaging system.”
“Fast, Scalable, Durable, Distributed”
Kafka is a natural fit to deliver events
into Apex for low-latency processing.
Kafka Integration - Consumer
6
• Low-latency, high throughput ingest
• Scales with Kafka topics
ᵒ Auto-partitioning
ᵒ Flexible and customizable partition mapping
• Fault-tolerance (in 0.8 based on SimpleConsumer)
ᵒ Metadata monitoring/failover to new broker
ᵒ Offset checkpointing
ᵒ Idempotency
ᵒ External offset storage
• Support for multiple clusters
ᵒ Built for better resource utilization
• Bandwidth control
ᵒ Bytes per second
Kafka Integration - Producer
7
• Output operator is a Kafka producer
• Exactly once strategy
ᵒ On failure data already sent to message queue should not be re-sent
ᵒ Sends a key along with data that is monotonically increasing
ᵒ On recovery operator asks the message queue for the last sent message
• Gets the recovery key from the message
ᵒ Ignores all replayed data with key that is less than or equal to the recovered key
ᵒ If the key is not monotonically increasing then data can be sorted on the key at the
end of the window and sent to message queue
• Implemented in operator AbstractExactlyOnceKafkaOutputOperator in
apache/incubator-apex-malhar github repository available here
Apex Application Specification
8
Logical and Physical Plan
9
Partitioning
10
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
Advanced Partitioning
11
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
Dynamic Scaling
12
 Partitioning change while application is running
• Change number of partitions at runtime based on stats
• Determine initial number of partitions dynamically
– Kafka operators scale according to number of Kafka partitions
• Supports re-distribution of state when number of partitions change
• API for custom scaling or partitioning
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
Fault Tolerance
13
• Operator state is checkpointed to persistent store
ᵒ Automatically performed by engine, no additional coding needed
ᵒ Asynchronous and distributed
ᵒ In case of failure operators are restarted from checkpoint state
• Automatic detection and recovery of failed containers
ᵒ Heartbeat mechanism
ᵒ YARN process status notification
• Buffering to enable replay of data from recovered point
ᵒ Fast, incremental recovery, spike handling
• Application master state checkpointed
ᵒ Snapshot of physical (and logical) plan
ᵒ Execution layer change log
Streaming Windows
14
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency
Checkpointing Operator State
15
• Save state of operator so that it can be recovered on failure
• Pluggable storage handler
• Default implementation
ᵒ Serialization with Kryo
ᵒ All non-transient fields serialized
ᵒ Serialized state written to HDFS
ᵒ Writes asynchronous, non-blocking
• Possible to implement custom handlers for alternative approach to
extract state or different storage backend (such as IMDG)
• For operators that rely on previous state for computation
ᵒ Operators can be marked @Stateless to skip checkpointing
• Checkpoint frequency tunable (by default 30s)
ᵒ Based on streaming windows for consistent state
Processing Guarantees
16
At-least-once
• On recovery data will be replayed from a previous checkpoint
ᵒ No messages lost
ᵒ Default, suitable for most applications
• Can be used to ensure data is written once to store
ᵒ Transactions with meta information, Rewinding output, Feedback from
external entity, Idempotent operations
At-most-once
• On recovery the latest data is made available to operator
ᵒ Useful in use cases where some data loss is acceptable and latest data is
sufficient
Exactly-once
ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to
achieve end-to-end exactly once behavior
Idempotency with Kafka Consumer
17
Use Case – Ad Tech
Customer:
• Leading digital automation software company for publishers
• Helps publishers monetize their digital assets
• Enables publishers to make smarter inventory decisions and improve revenue
Features:
• Reporting of critical metrics from auctions and client logs
• Revenue, impression, and click information
• Aggregate counters and reporting on top N metrics
• Low latency querying using pub-sub model
18
Use Case – Ad Tech
19
User
Browser
AdServer
REST proxy
REST proxy
Kafka
Cluster
Client
logs
Kafka Input
(Auction logs)
Kafka Input
(Client logs)
CDN
(Caching
of logs)
ETL ETL
Filter Filter
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions
Store
Query Query
Result
Kafka
Cluster
Auction
Logs
Client
logs
Middleware
Auction
Logs
Client logs
Kafka Messages Kafka Messages
Decompress
& Flatten
Decompress
& Flatten
Filtered Events Filtered Events
Aggregates
Query from
MW
Query Query
Results
Kafka
Cluster
Use Case – Ad Tech
20
Use Case – Ad Tech
• 15+ billion impressions per day
• Average data inflow of 200K events/sec
• 64 Kafka Input operators reading from 6 geographically distributed DCs
• 32 instances of in-memory distributed store
• 64 aggregators
• ~150 container processes, 30+ nodes
• 1.2 TB memory footprint @ peak load
21
Resources
22
• Exactly-once processing: https://www.datatorrent.com/blog/end-to-end-
exactly-once-with-apache-apex/
• Examples with Kafka and Files: https://github.com/tweise/apex-
samples/tree/master/exactly-once
• Learn more: http://apex.incubator.apache.org/docs.html
• Subscribe - http://apex.incubator.apache.org/community.html
• Download - http://apex.incubator.apache.org/downloads.html
• Apex website - http://apex.incubator.apache.org/
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups - http://www.meetup.com/topics/apache-apex
Q&A
23

More Related Content

What's hot (20)

PPTX
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
PDF
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
PPTX
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 
PDF
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Capital One's Next Generation Decision in less than 2 ms
Apache Apex
 
PPTX
Introduction to Apache Apex
Apache Apex
 
PDF
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
PPTX
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
PDF
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
PPTX
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
PDF
Apex as yarn application
Chinmay Kolhatkar
 
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
PPTX
Fault-Tolerant File Input & Output
Apache Apex
 
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
PPTX
Deep Dive into Apache Apex App Development
Apache Apex
 
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Capital One's Next Generation Decision in less than 2 ms
Apache Apex
 
Introduction to Apache Apex
Apache Apex
 
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
Apex as yarn application
Chinmay Kolhatkar
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
Fault-Tolerant File Input & Output
Apache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
Deep Dive into Apache Apex App Development
Apache Apex
 

Viewers also liked (20)

ODP
Open source and business rules
Geoffrey De Smet
 
PPT
Introduction to Drools
giurca
 
PDF
FOSS in the Enterprise
Crishantha Nanayakkara
 
PPTX
Jboss drools 4 scope - benefits, shortfalls
Zoran Hristov
 
PDF
Drools & jBPM Workshop London 2013
Mauricio (Salaboy) Salatino
 
PPTX
Apache Beam (incubating)
Apache Apex
 
ODP
Drools BeJUG 2010
Geoffrey De Smet
 
PDF
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Mauricio (Salaboy) Salatino
 
PDF
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
ODP
Drools & jBPM Info Sheet
Mark Proctor
 
PDF
Intro to Drools - St Louis Gateway JUG
Ray Ploski
 
PDF
Rules Programming tutorial
Srinath Perera
 
PDF
IIA4: Open Source and the Enterprise ( Predix Transform 2016)
Predix
 
PDF
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Neville Li
 
PDF
Apache Beam @ GCPUG.TW Flink.TW 20161006
Randy Huang
 
PDF
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
PDF
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
PDF
IIA3: Coding Like a Unicorn (Predix Transform 2016)
Predix
 
PDF
Drools
John Paulett
 
Open source and business rules
Geoffrey De Smet
 
Introduction to Drools
giurca
 
FOSS in the Enterprise
Crishantha Nanayakkara
 
Jboss drools 4 scope - benefits, shortfalls
Zoran Hristov
 
Drools & jBPM Workshop London 2013
Mauricio (Salaboy) Salatino
 
Apache Beam (incubating)
Apache Apex
 
Drools BeJUG 2010
Geoffrey De Smet
 
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Mauricio (Salaboy) Salatino
 
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
Drools & jBPM Info Sheet
Mark Proctor
 
Intro to Drools - St Louis Gateway JUG
Ray Ploski
 
Rules Programming tutorial
Srinath Perera
 
IIA4: Open Source and the Enterprise ( Predix Transform 2016)
Predix
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Neville Li
 
Apache Beam @ GCPUG.TW Flink.TW 20161006
Randy Huang
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
IIA3: Coding Like a Unicorn (Predix Transform 2016)
Predix
 
Drools
John Paulett
 
Ad

Similar to Stream data from Apache Kafka for processing with Apache Apex (20)

PDF
BigDataSpain 2016: Introduction to Apache Apex
Thomas Weise
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
PDF
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Real Time Insights for Advertising Tech
Apache Apex
 
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
PDF
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
HostedbyConfluent
 
PDF
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Karthik Ramasamy
 
PPTX
Stream Processing @ Lyft
Jamie Grier
 
PDF
Apache Pulsar Overview
Streamlio
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 
PPTX
Data Stream Processing with Apache Flink
Fabian Hueske
 
PPTX
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
BigDataSpain 2016: Introduction to Apache Apex
Thomas Weise
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Real Time Insights for Advertising Tech
Apache Apex
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
HostedbyConfluent
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Karthik Ramasamy
 
Stream Processing @ Lyft
Jamie Grier
 
Apache Pulsar Overview
Streamlio
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 
Data Stream Processing with Apache Flink
Fabian Hueske
 
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
Ad

More from Apache Apex (16)

PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Apex
 
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
PPTX
Hadoop Interacting with HDFS
Apache Apex
 
PPTX
Introduction to Real-Time Data Processing
Apache Apex
 
PPTX
Introduction to Yarn
Apache Apex
 
PPTX
Introduction to Map Reduce
Apache Apex
 
PPTX
HDFS Internals
Apache Apex
 
PPTX
Intro to Big Data Hadoop
Apache Apex
 
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
PPTX
Java High Level Stream API
Apache Apex
 
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
PPTX
Apache Apex & Bigtop
Apache Apex
 
PDF
Building Your First Apache Apex Application
Apache Apex
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Apache Apex
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to Real-Time Data Processing
Apache Apex
 
Introduction to Yarn
Apache Apex
 
Introduction to Map Reduce
Apache Apex
 
HDFS Internals
Apache Apex
 
Intro to Big Data Hadoop
Apache Apex
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
Java High Level Stream API
Apache Apex
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
Apache Apex & Bigtop
Apache Apex
 
Building Your First Apache Apex Application
Apache Apex
 

Recently uploaded (20)

PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PPTX
The birth and death of Stars - earth and life science
rizellemarieastrolo
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PPTX
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PDF
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
Kubernetes - Architecture & Components.pdf
geethak285
 
PDF
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PDF
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
PDF
Next level data operations using Power Automate magic
Andries den Haan
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
The birth and death of Stars - earth and life science
rizellemarieastrolo
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
Kubernetes - Architecture & Components.pdf
geethak285
 
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
Next level data operations using Power Automate magic
Andries den Haan
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 

Stream data from Apache Kafka for processing with Apache Apex

  • 1. Low-latency ingestion and analytics with Apache Kafka and Apache Apex Thomas Weise, Architect DataTorrent, PPMC member Apache Apex March 28th 2016
  • 2. Apache Apex Features • In-memory Stream Processing • Scale out, Distributed, Parallel, High Throughput • Windowing (temporal boundary) • Reliability, Fault Tolerance • Operability • YARN native • Compute Locality • Dynamic updates 2
  • 4. Apache Apex Malhar Library 4
  • 5. Apache Kafka 5 “A high-throughput distributed messaging system.” “Fast, Scalable, Durable, Distributed” Kafka is a natural fit to deliver events into Apex for low-latency processing.
  • 6. Kafka Integration - Consumer 6 • Low-latency, high throughput ingest • Scales with Kafka topics ᵒ Auto-partitioning ᵒ Flexible and customizable partition mapping • Fault-tolerance (in 0.8 based on SimpleConsumer) ᵒ Metadata monitoring/failover to new broker ᵒ Offset checkpointing ᵒ Idempotency ᵒ External offset storage • Support for multiple clusters ᵒ Built for better resource utilization • Bandwidth control ᵒ Bytes per second
  • 7. Kafka Integration - Producer 7 • Output operator is a Kafka producer • Exactly once strategy ᵒ On failure data already sent to message queue should not be re-sent ᵒ Sends a key along with data that is monotonically increasing ᵒ On recovery operator asks the message queue for the last sent message • Gets the recovery key from the message ᵒ Ignores all replayed data with key that is less than or equal to the recovered key ᵒ If the key is not monotonically increasing then data can be sorted on the key at the end of the window and sent to message queue • Implemented in operator AbstractExactlyOnceKafkaOutputOperator in apache/incubator-apex-malhar github repository available here
  • 10. Partitioning 10 NxM PartitionsUnifier 0 1 2 3 Logical DAG 0 1 2 1 1 Unifier 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 Unifier 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck Unifier Unifier0 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  • 11. Advanced Partitioning 11 0 1a 1b 2 3 4Unifier Physical DAG 0 4 3a2a1a 1b 2b 3b Unifier Physical DAG with Parallel Partition Parallel Partition Container uopr uopr1 uopr2 uopr3 uopr4 uopr1 uopr2 uopr3 uopr4 dopr dopr doprunifier unifier unifier unifier Container Container NICNIC NICNIC NIC Container NIC Logical Plan Execution Plan, for N = 4; M = 1 Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers Cascading Unifiers 0 1 2 3 4 Logical DAG
  • 12. Dynamic Scaling 12  Partitioning change while application is running • Change number of partitions at runtime based on stats • Determine initial number of partitions dynamically – Kafka operators scale according to number of Kafka partitions • Supports re-distribution of state when number of partitions change • API for custom scaling or partitioning 2b 2c 3 2a 2d 1b 1a1a 2a 1b 2b 3 1a 2b 1b 2c 3b 2a 2d 3a Unifiers not shown
  • 13. Fault Tolerance 13 • Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log
  • 14. Streaming Windows 14  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  • 15. Checkpointing Operator State 15 • Save state of operator so that it can be recovered on failure • Pluggable storage handler • Default implementation ᵒ Serialization with Kryo ᵒ All non-transient fields serialized ᵒ Serialized state written to HDFS ᵒ Writes asynchronous, non-blocking • Possible to implement custom handlers for alternative approach to extract state or different storage backend (such as IMDG) • For operators that rely on previous state for computation ᵒ Operators can be marked @Stateless to skip checkpointing • Checkpoint frequency tunable (by default 30s) ᵒ Based on streaming windows for consistent state
  • 16. Processing Guarantees 16 At-least-once • On recovery data will be replayed from a previous checkpoint ᵒ No messages lost ᵒ Default, suitable for most applications • Can be used to ensure data is written once to store ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most-once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly-once ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior
  • 17. Idempotency with Kafka Consumer 17
  • 18. Use Case – Ad Tech Customer: • Leading digital automation software company for publishers • Helps publishers monetize their digital assets • Enables publishers to make smarter inventory decisions and improve revenue Features: • Reporting of critical metrics from auctions and client logs • Revenue, impression, and click information • Aggregate counters and reporting on top N metrics • Low latency querying using pub-sub model 18
  • 19. Use Case – Ad Tech 19 User Browser AdServer REST proxy REST proxy Kafka Cluster Client logs Kafka Input (Auction logs) Kafka Input (Client logs) CDN (Caching of logs) ETL ETL Filter Filter Dimensions Aggregator Dimensions Aggregator Dimensions Store Query Query Result Kafka Cluster Auction Logs Client logs Middleware Auction Logs Client logs Kafka Messages Kafka Messages Decompress & Flatten Decompress & Flatten Filtered Events Filtered Events Aggregates Query from MW Query Query Results Kafka Cluster
  • 20. Use Case – Ad Tech 20
  • 21. Use Case – Ad Tech • 15+ billion impressions per day • Average data inflow of 200K events/sec • 64 Kafka Input operators reading from 6 geographically distributed DCs • 32 instances of in-memory distributed store • 64 aggregators • ~150 container processes, 30+ nodes • 1.2 TB memory footprint @ peak load 21
  • 22. Resources 22 • Exactly-once processing: https://www.datatorrent.com/blog/end-to-end- exactly-once-with-apache-apex/ • Examples with Kafka and Files: https://github.com/tweise/apex- samples/tree/master/exactly-once • Learn more: http://apex.incubator.apache.org/docs.html • Subscribe - http://apex.incubator.apache.org/community.html • Download - http://apex.incubator.apache.org/downloads.html • Apex website - http://apex.incubator.apache.org/ • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups - http://www.meetup.com/topics/apache-apex

Editor's Notes

  • #3: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #6: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #7: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #8: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #19: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #20: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #21: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries
  • #22: Partitioning & Scaling built-in Operators can be dynamically scaled Throughput, latency or any custom logic Streams can be split in flexible ways Tuple hashcode, tuple field or custom logic Parallel partitioning for parallel pipelines MxN partitioning for generic pipelines Unifier concept for merging results from partitions Helps in handling skew imbalance Advanced Windowing support Application window configurable per operator Sliding window and tumbling window support Checkpoint window control for fault recovery Windowing does not introduce artificial latency Stateful fault tolerance out of the box Operators recover automatically from a precise point before failure At least once At most once Exactly once at window boundaries