From Batch to Streaming with Apache Apex Dataworks Summit 2017

From Batch to Streaming ET(L) with
Apache Apex
Thomas Weise
Apache Apex PMC Chair
thw@apache.org
@thweise @atrato_io

Stream Data Processing with Apache Apex
2
Mobile Devices
Logs
Sensor Data
Social
Databases
CDC
Oper1 Oper2 Oper3
Real-time visualization,
storage, etc
Data Delivery & Storage Transform / Analytics
SQL
Declarative
API
DAG API
SAMOA
Beam
Operator
Library
SAMOA
Beam
(roadmap)
Data Sources

https://www.slideshare.net/ashishtadose1/realtime-adtech-reporting-targeting-with-apache-apex
Use Case
3

Batch processing with several
hours till insight:
● Available data stale, does
no longer apply to current
situation
● Current data stuck in batch
pipeline
● Complex batch processing
orchestration with many
different components
● Hours of delay translate to
high cost due to inability to
make timely campaign
adjustments.
Batch pipeline
> 5 hours
Processing with Apex, reuse
batch ingestion:
● Existing ingestion
mechanism (files in S3,
shared with other
pipelines)
● Migrate transform logic to
Apex
● Enable reporting from
application state
(“Queryable State”)
● Reduced latency, valuable
as intermediate step.
Batch ingest +
streaming transforms
~ 20 minutes
Streaming source and
processing:
● Data comes directly from
Kafka clusters
● Significantly reduced
latency
● Balanced load (no
ingestion spikes)
● Reduced resource
consumption with Apex
support for multi-cluster
Kafka consumers
● Reporting meets SLA
requirements
End-to-end stream
processing
seconds
4
Phased Transition

Phase 1 (batch ingest)
https://www.slideshare.net/ApacheApex/real-time-insights-for-advertising-tech5

Real-time Dashboard
https://www.slideshare.net/ashishtadose1/realtime-adtech-reporting-targeting-with-apache-apex8

Pipeline Transformations
9
Kafka/
Files
Decompress
& Parse
Decompress
& Parse
Decompress
& Parse
Enrich
& Map
Enrich
& Map
Enrich
& Map
Dimensional
Compute
Dimensional
Compute
Dimensional
Compute
Query
Results
Visualization
Input Tuples
Input Tuples
Input Tuples
Parsed
Tuples
Parsed
Tuples
Parsed
Tuples
Enriched
Tuples
Enriched
Tuples
Enriched
Tuples
Partial
Aggregates
Partial
Aggregates
Partial
Aggregates
Visualization
Results
Visualization
Query
Aggregate
Query
Aggregate
Results
https://www.slideshare.net/ApacheApex/actionable-insights-with-apache-apex-at-apache-big-data-2017-by-devendra-tagare
Store
Store

Dimension Computation
10
hour advertiser location cost revenue impr clicks
10:00 6 9 => 10 22 3
10:00 Burger King 4 6 12 2
10:00 Subway 2 3 => 4 10 2
10:00 CA 4 6 15 3
10:00 WA 2 3 => 4 7 1
10:00 Burger King CA 2 3 5 1
10:00 Burger King WA 2 3 7 1
10:00 Subway CA 2 3 10 2
10:00 Subway WA 0 => 1
Advertiser: Subway
Location: WA
Cost: 2
Revenue: 1
Impressions: 5
Clicks: 1
Time: 10:15:30

● 6 geographically distributed data centers
● 10 PB of data under management
● 50 TB/day of data generated from auction & client logs
● 40+ billion ad impressions and 350+ billion bids per day
● Average data inflow of 450K events/sec
● 64 Kafka Input partitions, 32 instances of in-memory distributed store
● 1.2 TB of memory for the Apex application
Scale
11

● State Management & Fault tolerance
○ Exactly-once, Checkpointing and Windowing
○ Fine grained recovery, low-latency SLA support
○ Queryable state
● Processing based on event time
○ Accuracy, Repeatable/Replay
● Native Streaming
○ Low latency + high throughput, efficient resource utilization
○ Pipelined processing (data in motion)
● Scalability
○ Process more data by adding compute resources, no platform/architecture limits
○ Dynamic scaling and resource allocation, elasticity
● Library of connectors and transformations
○ Time to value
Why Apex
12

Apex Library
13
Stateful Transformations
• Windowing: sliding, tumbling, session
• Accumulations: sum, merge, join, sort, top n, …
• Triggering, Watermarks
• Dimensional Aggregations (with state management for historical
data + query)
• Deduplication
RDBMS
• JDBC
• MySQL
• Oracle
• MemSQL
NoSQL
• Cassandra, HBase
• Aerospike, Accumulo
• Couchbase, CouchDB
• Redis, MongoDB
• Geode, Kudu
Messaging
• Kafka
• JMS (ActiveMQ etc.)
• Kinesis, SQS
• Flume, NiFi
• MQTT
File Systems
• HDFS / Hive
• Local File
• S3
• FTP
Stateless Transformations
• Parsers: XML, JSON, CSV, Avro
• Filter
• Enrich
• Configurable POJO schema
• Map, FlatMap (custom Java function)
• Script (JavaScript, Jython)
Other
• Elastic Search
• Solr
• Twitter
• WebSocket / HTTP
• SMTP

Example Application (Twitter)
● Top N hashtags
● Tweet stats time series
● Queryable state
● WebSocket Pub/Sub
● Visualization with Grafana
Source code: https://github.com/tweise/apex-samples/tree/master/twitter
15

Top Hashtags
● Keyed sum accumulation (5 minute window, count trigger)
● TopN accumulation of upstream windowed counts
17

Queryable State
A set of operators in the library that support real-time queries of operator state.
18
Hashtag
Extractor
TopN
Window
Twitter Feed
Input
Operator
CountByKey
Window
Snapshot
Server Result
Pub/Sub
Broker
HTTPWebSocket
Query
Input
● Pub/Sub server: https://github.com/atrato/pubsub-server
● Grafana data source: https://github.com/atrato/apex-grafana-datasource-server

Queryable State
● Snapshot server
○ Stateful operator that holds last received list of objects
○ Receives query and emits the list as JSON formatted query result
● Source schema configured, result fields via query
● Predefined schemas (Apex library): “Snapshot”, “Dimensional”
19

● Apex runner in Apache Beam
● Iterative processing
● Integrated with Apache Samoa, opens up ML
● Integrated with Apache Calcite, enables SQL
● Scalable, incremental state management
● User defined control tuples (watermarks, batch control, …)
● Enhanced support for Batch Processing
● Support for Mesos and Kubernetes
● Encrypted Streams
● Support for Python
Apex - Recent Additions & Roadmap
21

Resources
22
• http://apex.apache.org/
• Powered by Apex - http://apex.apache.org/powered-by-apex.html
• Learn more - http://apex.apache.org/docs.html
• Getting involved - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups - https://www.meetup.com/topics/apache-apex/
• Examples - https://github.com/apache/apex-malhar/tree/master/examples
• Slideshare - http://www.slideshare.net/ApacheApex/presentations

From Batch to Streaming with Apache Apex Dataworks Summit 2017

Recommended

More Related Content

What's hot (20)

Similar to From Batch to Streaming with Apache Apex Dataworks Summit 2017 (20)

More from Apache Apex (11)

Recently uploaded (20)

From Batch to Streaming with Apache Apex Dataworks Summit 2017