SlideShare a Scribd company logo
From Batch to Streaming ET(L) with
Apache Apex
Thomas Weise
Apache Apex PMC Chair
thw@apache.org
@thweise @atrato_io
Stream Data Processing with Apache Apex
2
Mobile Devices
Logs
Sensor Data
Social
Databases
CDC
Oper1 Oper2 Oper3
Real-time visualization,
storage, etc
Data Delivery & Storage Transform / Analytics
SQL
Declarative
API
DAG API
SAMOA
Beam
Operator
Library
SAMOA
Beam
(roadmap)
Data Sources
https://www.slideshare.net/ashishtadose1/realtime-adtech-reporting-targeting-with-apache-apex
Use Case
3
Batch processing with several
hours till insight:
● Available data stale, does
no longer apply to current
situation
● Current data stuck in batch
pipeline
● Complex batch processing
orchestration with many
different components
● Hours of delay translate to
high cost due to inability to
make timely campaign
adjustments.
Batch pipeline
> 5 hours
Processing with Apex, reuse
batch ingestion:
● Existing ingestion
mechanism (files in S3,
shared with other
pipelines)
● Migrate transform logic to
Apex
● Enable reporting from
application state
(“Queryable State”)
● Reduced latency, valuable
as intermediate step.
Batch ingest +
streaming transforms
~ 20 minutes
Streaming source and
processing:
● Data comes directly from
Kafka clusters
● Significantly reduced
latency
● Balanced load (no
ingestion spikes)
● Reduced resource
consumption with Apex
support for multi-cluster
Kafka consumers
● Reporting meets SLA
requirements
End-to-end stream
processing
seconds
4
Phased Transition
Phase 1 (batch ingest)
https://www.slideshare.net/ApacheApex/real-time-insights-for-advertising-tech5
Phase 2 (hybrid)
6
Real-time Streaming
7
Real-time Dashboard
https://www.slideshare.net/ashishtadose1/realtime-adtech-reporting-targeting-with-apache-apex8
Pipeline Transformations
9
Kafka/
Files
Decompress
& Parse
Decompress
& Parse
Decompress
& Parse
Enrich
& Map
Enrich
& Map
Enrich
& Map
Dimensional
Compute
Dimensional
Compute
Dimensional
Compute
Query
Results
Visualization
Input Tuples
Input Tuples
Input Tuples
Parsed
Tuples
Parsed
Tuples
Parsed
Tuples
Enriched
Tuples
Enriched
Tuples
Enriched
Tuples
Partial
Aggregates
Partial
Aggregates
Partial
Aggregates
Visualization
Results
Visualization
Query
Aggregate
Query
Aggregate
Results
https://www.slideshare.net/ApacheApex/actionable-insights-with-apache-apex-at-apache-big-data-2017-by-devendra-tagare
Store
Store
Dimension Computation
10
hour advertiser location cost revenue impr clicks
10:00 6 9 => 10 22 3
10:00 Burger King 4 6 12 2
10:00 Subway 2 3 => 4 10 2
10:00 CA 4 6 15 3
10:00 WA 2 3 => 4 7 1
10:00 Burger King CA 2 3 5 1
10:00 Burger King WA 2 3 7 1
10:00 Subway CA 2 3 10 2
10:00 Subway WA 0 => 1
Advertiser: Subway
Location: WA
Cost: 2
Revenue: 1
Impressions: 5
Clicks: 1
Time: 10:15:30
● 6 geographically distributed data centers
● 10 PB of data under management
● 50 TB/day of data generated from auction & client logs
● 40+ billion ad impressions and 350+ billion bids per day
● Average data inflow of 450K events/sec
● 64 Kafka Input partitions, 32 instances of in-memory distributed store
● 1.2 TB of memory for the Apex application
Scale
11
● State Management & Fault tolerance
○ Exactly-once, Checkpointing and Windowing
○ Fine grained recovery, low-latency SLA support
○ Queryable state
● Processing based on event time
○ Accuracy, Repeatable/Replay
● Native Streaming
○ Low latency + high throughput, efficient resource utilization
○ Pipelined processing (data in motion)
● Scalability
○ Process more data by adding compute resources, no platform/architecture limits
○ Dynamic scaling and resource allocation, elasticity
● Library of connectors and transformations
○ Time to value
Why Apex
12
Apex Library
13
Stateful Transformations
• Windowing: sliding, tumbling, session
• Accumulations: sum, merge, join, sort, top n, …
• Triggering, Watermarks
• Dimensional Aggregations (with state management for historical
data + query)
• Deduplication
RDBMS
• JDBC
• MySQL
• Oracle
• MemSQL
NoSQL
• Cassandra, HBase
• Aerospike, Accumulo
• Couchbase, CouchDB
• Redis, MongoDB
• Geode, Kudu
Messaging
• Kafka
• JMS (ActiveMQ etc.)
• Kinesis, SQS
• Flume, NiFi
• MQTT
File Systems
• HDFS / Hive
• Local File
• S3
• FTP
Stateless Transformations
• Parsers: XML, JSON, CSV, Avro
• Filter
• Enrich
• Configurable POJO schema
• Map, FlatMap (custom Java function)
• Script (JavaScript, Jython)
Other
• Elastic Search
• Solr
• Twitter
• WebSocket / HTTP
• SMTP
How to build it
14
Example Application (Twitter)
● Top N hashtags
● Tweet stats time series
● Queryable state
● WebSocket Pub/Sub
● Visualization with Grafana
Source code: https://github.com/tweise/apex-samples/tree/master/twitter
15
Real-time Visualization
16
Top Hashtags
● Keyed sum accumulation (5 minute window, count trigger)
● TopN accumulation of upstream windowed counts
17
Queryable State
A set of operators in the library that support real-time queries of operator state.
18
Hashtag
Extractor
TopN
Window
Twitter Feed
Input
Operator
CountByKey
Window
Snapshot
Server Result
Pub/Sub
Broker
HTTPWebSocket
Query
Input
● Pub/Sub server: https://github.com/atrato/pubsub-server
● Grafana data source: https://github.com/atrato/apex-grafana-datasource-server
Queryable State
● Snapshot server
○ Stateful operator that holds last received list of objects
○ Receives query and emits the list as JSON formatted query result
● Source schema configured, result fields via query
● Predefined schemas (Apex library): “Snapshot”, “Dimensional”
19
Demo
20
● Apex runner in Apache Beam
● Iterative processing
● Integrated with Apache Samoa, opens up ML
● Integrated with Apache Calcite, enables SQL
● Scalable, incremental state management
● User defined control tuples (watermarks, batch control, …)
● Enhanced support for Batch Processing
● Support for Mesos and Kubernetes
● Encrypted Streams
● Support for Python
Apex - Recent Additions & Roadmap
21
Resources
22
• http://apex.apache.org/
• Powered by Apex - http://apex.apache.org/powered-by-apex.html
• Learn more - http://apex.apache.org/docs.html
• Getting involved - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups - https://www.meetup.com/topics/apache-apex/
• Examples - https://github.com/apache/apex-malhar/tree/master/examples
• Slideshare - http://www.slideshare.net/ApacheApex/presentations
Ad

Recommended

PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Apex
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPTX
Java High Level Stream API
Apache Apex
 
PPTX
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
PPTX
Introduction to Apache Apex
Apache Apex
 
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
PDF
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
PDF
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Deep Dive into Apache Apex App Development
Apache Apex
 
PDF
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
PPTX
Introduction to Real-Time Data Processing
Apache Apex
 
PDF
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
PPTX
Smart Partitioning with Apache Apex (Webinar)
Apache Apex
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PDF
Extending The Yahoo Streaming Benchmark to Apache Apex
Apache Apex
 
PPTX
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
PPTX
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
PDF
Apex as yarn application
Chinmay Kolhatkar
 
PDF
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 

More Related Content

What's hot (20)

PPTX
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
PDF
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Deep Dive into Apache Apex App Development
Apache Apex
 
PDF
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
PPTX
Introduction to Real-Time Data Processing
Apache Apex
 
PDF
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
PPTX
Smart Partitioning with Apache Apex (Webinar)
Apache Apex
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PDF
Extending The Yahoo Streaming Benchmark to Apache Apex
Apache Apex
 
PPTX
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
PPTX
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
PDF
Apex as yarn application
Chinmay Kolhatkar
 
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Deep Dive into Apache Apex App Development
Apache Apex
 
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Introduction to Real-Time Data Processing
Apache Apex
 
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Smart Partitioning with Apache Apex (Webinar)
Apache Apex
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Extending The Yahoo Streaming Benchmark to Apache Apex
Apache Apex
 
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
Apex as yarn application
Chinmay Kolhatkar
 

Similar to From Batch to Streaming with Apache Apex Dataworks Summit 2017 (20)

PDF
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Real Time Insights for Advertising Tech
Apache Apex
 
PDF
Visualizing Big Data in Realtime
DataWorks Summit
 
PDF
Cloud Lambda Architecture Patterns
Asis Mohanty
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PDF
Building end to end streaming application on Spark
datamantra
 
PDF
Big Data Architecture
Guido Schmutz
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
PPTX
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
AWS User Group Kochi
 
PDF
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Stavros Kontopoulos
 
PDF
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thessaloniki
 
PPTX
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
PDF
Lyft data Platform - 2019 slides
Karthik Murugesan
 
PDF
The Lyft data platform: Now and in the future
markgrover
 
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Real Time Insights for Advertising Tech
Apache Apex
 
Visualizing Big Data in Realtime
DataWorks Summit
 
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Building end to end streaming application on Spark
datamantra
 
Big Data Architecture
Guido Schmutz
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
AWS User Group Kochi
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Stavros Kontopoulos
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thessaloniki
 
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Lyft data Platform - 2019 slides
Karthik Murugesan
 
The Lyft data platform: Now and in the future
markgrover
 
Ad

More from Apache Apex (11)

PPTX
Hadoop Interacting with HDFS
Apache Apex
 
PPTX
Introduction to Yarn
Apache Apex
 
PPTX
Introduction to Map Reduce
Apache Apex
 
PPTX
HDFS Internals
Apache Apex
 
PPTX
Intro to Big Data Hadoop
Apache Apex
 
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
PPTX
Apache Beam (incubating)
Apache Apex
 
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
PPTX
Apache Apex & Bigtop
Apache Apex
 
PDF
Building Your First Apache Apex Application
Apache Apex
 
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to Yarn
Apache Apex
 
Introduction to Map Reduce
Apache Apex
 
HDFS Internals
Apache Apex
 
Intro to Big Data Hadoop
Apache Apex
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
Apache Beam (incubating)
Apache Apex
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
Apache Apex & Bigtop
Apache Apex
 
Building Your First Apache Apex Application
Apache Apex
 
Ad

Recently uploaded (20)

PDF
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
PDF
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
PPTX
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Josef Weingand
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
PDF
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
PDF
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
PDF
The Growing Value and Application of FME & GenAI
Safe Software
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
PDF
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
PDF
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PDF
Python Conference Singapore - 19 Jun 2025
ninefyi
 
PDF
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
PDF
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
PDF
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
PPTX
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
PDF
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
PPTX
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
PPTX
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
 
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Josef Weingand
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
The Growing Value and Application of FME & GenAI
Safe Software
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 

From Batch to Streaming with Apache Apex Dataworks Summit 2017

  • 1. From Batch to Streaming ET(L) with Apache Apex Thomas Weise Apache Apex PMC Chair [email protected] @thweise @atrato_io
  • 2. Stream Data Processing with Apache Apex 2 Mobile Devices Logs Sensor Data Social Databases CDC Oper1 Oper2 Oper3 Real-time visualization, storage, etc Data Delivery & Storage Transform / Analytics SQL Declarative API DAG API SAMOA Beam Operator Library SAMOA Beam (roadmap) Data Sources
  • 4. Batch processing with several hours till insight: ● Available data stale, does no longer apply to current situation ● Current data stuck in batch pipeline ● Complex batch processing orchestration with many different components ● Hours of delay translate to high cost due to inability to make timely campaign adjustments. Batch pipeline > 5 hours Processing with Apex, reuse batch ingestion: ● Existing ingestion mechanism (files in S3, shared with other pipelines) ● Migrate transform logic to Apex ● Enable reporting from application state (“Queryable State”) ● Reduced latency, valuable as intermediate step. Batch ingest + streaming transforms ~ 20 minutes Streaming source and processing: ● Data comes directly from Kafka clusters ● Significantly reduced latency ● Balanced load (no ingestion spikes) ● Reduced resource consumption with Apex support for multi-cluster Kafka consumers ● Reporting meets SLA requirements End-to-end stream processing seconds 4 Phased Transition
  • 5. Phase 1 (batch ingest) https://www.slideshare.net/ApacheApex/real-time-insights-for-advertising-tech5
  • 9. Pipeline Transformations 9 Kafka/ Files Decompress & Parse Decompress & Parse Decompress & Parse Enrich & Map Enrich & Map Enrich & Map Dimensional Compute Dimensional Compute Dimensional Compute Query Results Visualization Input Tuples Input Tuples Input Tuples Parsed Tuples Parsed Tuples Parsed Tuples Enriched Tuples Enriched Tuples Enriched Tuples Partial Aggregates Partial Aggregates Partial Aggregates Visualization Results Visualization Query Aggregate Query Aggregate Results https://www.slideshare.net/ApacheApex/actionable-insights-with-apache-apex-at-apache-big-data-2017-by-devendra-tagare Store Store
  • 10. Dimension Computation 10 hour advertiser location cost revenue impr clicks 10:00 6 9 => 10 22 3 10:00 Burger King 4 6 12 2 10:00 Subway 2 3 => 4 10 2 10:00 CA 4 6 15 3 10:00 WA 2 3 => 4 7 1 10:00 Burger King CA 2 3 5 1 10:00 Burger King WA 2 3 7 1 10:00 Subway CA 2 3 10 2 10:00 Subway WA 0 => 1 Advertiser: Subway Location: WA Cost: 2 Revenue: 1 Impressions: 5 Clicks: 1 Time: 10:15:30
  • 11. ● 6 geographically distributed data centers ● 10 PB of data under management ● 50 TB/day of data generated from auction & client logs ● 40+ billion ad impressions and 350+ billion bids per day ● Average data inflow of 450K events/sec ● 64 Kafka Input partitions, 32 instances of in-memory distributed store ● 1.2 TB of memory for the Apex application Scale 11
  • 12. ● State Management & Fault tolerance ○ Exactly-once, Checkpointing and Windowing ○ Fine grained recovery, low-latency SLA support ○ Queryable state ● Processing based on event time ○ Accuracy, Repeatable/Replay ● Native Streaming ○ Low latency + high throughput, efficient resource utilization ○ Pipelined processing (data in motion) ● Scalability ○ Process more data by adding compute resources, no platform/architecture limits ○ Dynamic scaling and resource allocation, elasticity ● Library of connectors and transformations ○ Time to value Why Apex 12
  • 13. Apex Library 13 Stateful Transformations • Windowing: sliding, tumbling, session • Accumulations: sum, merge, join, sort, top n, … • Triggering, Watermarks • Dimensional Aggregations (with state management for historical data + query) • Deduplication RDBMS • JDBC • MySQL • Oracle • MemSQL NoSQL • Cassandra, HBase • Aerospike, Accumulo • Couchbase, CouchDB • Redis, MongoDB • Geode, Kudu Messaging • Kafka • JMS (ActiveMQ etc.) • Kinesis, SQS • Flume, NiFi • MQTT File Systems • HDFS / Hive • Local File • S3 • FTP Stateless Transformations • Parsers: XML, JSON, CSV, Avro • Filter • Enrich • Configurable POJO schema • Map, FlatMap (custom Java function) • Script (JavaScript, Jython) Other • Elastic Search • Solr • Twitter • WebSocket / HTTP • SMTP
  • 14. How to build it 14
  • 15. Example Application (Twitter) ● Top N hashtags ● Tweet stats time series ● Queryable state ● WebSocket Pub/Sub ● Visualization with Grafana Source code: https://github.com/tweise/apex-samples/tree/master/twitter 15
  • 17. Top Hashtags ● Keyed sum accumulation (5 minute window, count trigger) ● TopN accumulation of upstream windowed counts 17
  • 18. Queryable State A set of operators in the library that support real-time queries of operator state. 18 Hashtag Extractor TopN Window Twitter Feed Input Operator CountByKey Window Snapshot Server Result Pub/Sub Broker HTTPWebSocket Query Input ● Pub/Sub server: https://github.com/atrato/pubsub-server ● Grafana data source: https://github.com/atrato/apex-grafana-datasource-server
  • 19. Queryable State ● Snapshot server ○ Stateful operator that holds last received list of objects ○ Receives query and emits the list as JSON formatted query result ● Source schema configured, result fields via query ● Predefined schemas (Apex library): “Snapshot”, “Dimensional” 19
  • 21. ● Apex runner in Apache Beam ● Iterative processing ● Integrated with Apache Samoa, opens up ML ● Integrated with Apache Calcite, enables SQL ● Scalable, incremental state management ● User defined control tuples (watermarks, batch control, …) ● Enhanced support for Batch Processing ● Support for Mesos and Kubernetes ● Encrypted Streams ● Support for Python Apex - Recent Additions & Roadmap 21
  • 22. Resources 22 • http://apex.apache.org/ • Powered by Apex - http://apex.apache.org/powered-by-apex.html • Learn more - http://apex.apache.org/docs.html • Getting involved - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups - https://www.meetup.com/topics/apache-apex/ • Examples - https://github.com/apache/apex-malhar/tree/master/examples • Slideshare - http://www.slideshare.net/ApacheApex/presentations