SlideShare a Scribd company logo
Apache Apex
Intro to Apex
Ingestion and Dimensions Compute for a customer use-case
Devendra Tagare
devendrat@datatorrent.com
@devtagare
9h July 2016
What is Apex
2
• Platform and runtime engine that enables development of scalable and fault-
tolerant distributed applications
• Hadoop native
• Process streaming or batch big data
• High throughput and low latency
• Library of commonly needed business logic
• Write any custom business logic in your application
Applications on Apex
3
• Distributed processing
• Application logic broken into components called operators that run in a distributed fashion across your
cluster
• Scalable
• Operators can be scaled up or down at runtime according to the load and SLA
• Fault tolerant
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved
• Long running applications
• Operators
• Use library to build applications quickly
• Write your own in Java using the API
• Operational insight – DataTorrent RTS
• See how each operator is performing and even record data
Apex Stack Overview
4
Apex Operator Library - Malhar
5
Native Hadoop Integration
6
• YARN is
the
resource
manager
• HDFS
used for
storing
any
persistent
state
Application Development Model
7
 A Stream is a sequence of data tuples
 A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
 Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
StreamTuple Tuple er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
Advanced Windowing Support
8
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency
Application in Java
9
Partitioning and unification
10
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
Advanced Partitioning
11
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
Dynamic Partitioning
12
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
• Ingest from Kafka and S3
• Parse, Filter and Enrich
• Dimensional compute for key performance indicators
• Reporting of critical metrics around campaign monetization
• Aggregate counters & reporting on top N metrics
• Low latency querying using Kafka in pub-sub model
Use Case ...
Screenshots - Demo UI
Proprietary and Confidential
Scale
• 6 geographically distributed data centers
• Combination of co-located & AWS based DC’s.
• > 5 PB under the data management
• 22 TB / day data generated from auction & client logs
• heterogeneous data log formats
• North of 15 Bn impressions / day.
• Average data inflow of 200K events/s
15
• Ad server log events consumed as Avro encoded, Snappy compressed files
from S3. New files uploaded every 10-20 minutes.
• Data may arrive in S3 out of order (time stamps).
• Event size is about 2KB uncompressed, only subset of fields retrieved for
aggregation.
• Aggregates kept in memory (checkpointed) with expiration policy, query
processing against in memory data.
• Front-end integration through Kafka based query protocol for realtime
dashboard components.
Initial Requirements
Proprietary and Confidential
Apache Apex
17
AdServer
REST proxy
REST proxy
Real-time architecture- Powered By Apex
Kafka
Cluster
S3Reader
S3Reader
Filter Operator Filter Operator
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions Store
Query Query
Result
Kafka
Cluster
Auction Logs
Middleware
Auction Logs
Filtered Events Filtered Events
Aggregates
Query from MW
Query Query Results
S3 S3 Client logsAuction Logs
Architecture 1.0 - Batch Reads + Streaming Aggregations
• Unstable S3 client libraries
– Unpredictable hangs and Corrupted data
– On Hang, Master kills the container and restart reading of file from different container
– Corrupt files caused containers to kill – application configurable retry mechanism and skip bad
files
– Limited read consumption throughput – 1 reader/file
• Out of Order data
– Some timestamp in future and past
• Spike in load when new files are added followed by period of inactivity
• Memory Requirement for Store
– Cardinality Estimation for incoming data
Challenges
Proprietary and Confidential
Apache Apex
19
REST proxy
Real-time architecture- Powered By Apex
Client logs
Kafka
Input
(Auction logs)
ETL operator
Filter Operator Filter Operator
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions
Store/HDHT
Query Query
Result
Kafka
Cluster
Auction Logs
Kafka
Cluster
Middleware
AdServer
REST proxy
Kafka
Cluster
Auction
Logs
Client logs
Kafka Messages
Decompress
& Flatten
Filtered Events Filtered Events
Aggregates
Query from MW
Query Query Results
S3
S3Reader
Kafka
Input
(Auction logs)
Auction
Logs
Architecture 2.0 - Batch + Streaming
Challenges
• Complex Logical DAG
• Kafka Operator Issues
– Dynamic Partitioning
– Memory Configuration
– Offset snapshotting to ensure exactly once semantics
• Resource Allocation
– More memory requirement for Store (Large number of Unifiers)
• Harder Debugging (More number of components)
– GB(s) of container logs
– Difficult to locate the sequence of failure
• More of data transferred over wire within cluster
• Limit Kafka read rate
Proprietary and Confidential 21
User
Browser
AdServer
REST proxy
REST proxy
Real-time architecture- Powered By Apex
Kafka
Cluster
Client logs
Kafka
Input
(Auction logs)
Kafka
Input
(Client logs)
CDN
(Caching of
logs)
ETL operator ETL operator
Filter Operator Filter Operator
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions Store
Query Query
Result
Kafka
Cluster
Auction Logs
Client logs
Middleware
Auction Logs
Client logs
Kafka Messages Kafka Messages
Decompress
& Flatten
Decompress
& Flatten
Filtered Events Filtered Events
Aggregates
Query from MW
Query Query Results
Architecture 3.0 - Streaming
Operational Architecture
Proprietary and Confidential
Application Configuration
• 64 Kafka Input operators reading from 6 geographically distributed DC’s
• Under 40 seconds end to end latency - from ad-serving to visualization
• 32 instances of in-memory distribute store
• 64 aggregators
• 1.2 TB memory footprint @ peak load
• In-memory store was later replaced by HDHT for fault tolerance
23
Proprietary and Confidential
Learning’s
• DAG – sizing, locality & partitioning (Benchmark)
• Memory sizing for the store or other memory heavy operators.
• Cardinality estimation for incoming data is critical.
• Upstream operators tend to require more memory than down-stream
operators for high velocity reads.
• Back pressure from down-stream failures due to skew in velocity of events
& upstream failures .. Buffer Server sizing is critical.
• For end to end exactly once its necessary to understand the external
systems semantics & delivery guarantees.
• Think fault tolerance & recovery before starting implementation.
24
Proprietary and Confidential
Before And After
25
5 Hours + 20 Minute
• No real-time processing system in place.
• Publishers and buyers could only rely on a
batch processing system for gathering
relevant data
• Outdated data, not relevant to
current time
• Current data being pushed to a
waiting queue
• Cumbersome batch-processing
lifecycle
• No visualization for reports
• No glimpse into everyday
happenings, translating to lost
decisions or untimely decision
making scenarios
Before Scenario After Scenario
• Phase 1
• With DataTorrent RTS (built
on Apache Apex), Dev team
put together the first real time
analytics platform
• This enabled Reporting of
critical metrics around
campaign monetization
• Reuse of batch ingestion
mechanism for the impression
data, shared with other
pipelines (S3)
~ 30 seconds
No Real-time Batch + Real-time
• Phase 2
• Reduce end-to-end latency
through real-time ingestion of
impression data from Kafka
• Results available much sooner
to the user
• Balances load (no more batch
ingestion spikes), reduces
resource consumption
• Handles ever growing traffic
with more efficient resource
utilization.
Real-time Streaming
Proprietary and Confidential
Operators used
S3 reader (File Input Operator)
• Recursively reading the contents of a S3 bucket based on a partitioning pattern
• Inclusion & exclusion support
• Fault tolerance (replay and idempotent)
• Throughput of over 12K reads/second for event size of 1.2 KB each
Kafka Input Operator
• Ability to consume from multiple Kafka clusters
• Offset management support
• Fault tolerant reads
• Support for idempotent & exactly once semantics
• Controlled reads for managing back-pressure
POJO Enrichment Operator
• takes a POJO as input and does a look-up in a store for given key
• supports caching
• stores are pluggable
• App builder ready
26
Proprietary and Confidential
Operators used (cont …)
Parser
• Specify JSON schema
• Emits a POJO based on the output schema
• No user code required
Dimension Store
• Distributed in-memory store
• Supports re-aggregation of events
• Partitioning of aggregates per view
• Low latency query support with a pub/sub model using Kafka
HDHT
• HDFS backed embedded key-value store
• Fault tolerant, random read & write
• Durability in-case of cold restarts
27
Dimensional Model - Key Concepts
Metrics : pieces of information we want to collect statistics about.
Dimensions : variables which can impact our measures.
Combinations : set of dimensions for which one or metric would be aggregated.They are sub-sets of
dimensions.
Aggregations : the aggregate function eg.. SUM, TOPN, Standard deviation.
Example :
Dimensions - campaignId, advertiserId, time
Metrics - Cost, revenue, clicks, impressions
Aggregate functions -SUM,AM etc..
Combinations :
1. campaignId x time - cost,revenue
2. advertiser - revenue, impressions
3. campaignId x advertiser x time - revenue, clicks, impressions
How to aggregate on the combinations ?
Dimensional Model
Dimensions Schema
{"keys":[{"name":"campaignId","type":"integer"},
{"name":"adId","type":"integer"},
{"name":"creativeId","type":"integer"},
{"name":"publisherId","type":"integer"},
{"name":"adOrderId","type":"integer"}],
"timeBuckets":["1h","1d"],
"values":
[{"name":"impressions","type":"integer","aggregators":["SUM"]},
{"name":"clicks","type":"integer","aggregators":["SUM"]},
{"name":"revenue","type":"integer"}],
"dimensions":
[{"combination":["campaignId","adId"]},
{"combination":["creativeId","campaignId"]},
{"combination":["campaignId"]},
{"combination":["publisherId","adOrderId","campaignId"],"additionalValues":["revenue:SUM"]}]
}
Proprietary and Confidential
More Use-cases
• Real-time Monitoring
Alerts on deal tracking & monetization
Campaign & deal health
• Real-time Learning
Using the lost bid insights for price recommendations.
• Allocation Engine
Feedback to ad serving for guaranteed delivery & line item pacing
30
Data Processing Pipeline Example
App Builder
31
Monitoring Console
Logical View
32
Monitoring Console
Physical View
33
Real-Time Dashboards
Real Time Visualization
34
Q&A
35
Resources
36
• http://apex.apache.org/
• Learn more: http://apex.apache.org/docs.html
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups – http://www.meetup.com/pro/apacheapex/
• More examples: https://github.com/DataTorrent/examples
• Slideshare: http://www.slideshare.net/ApacheApex/presentations
• https://www.youtube.com/results?search_query=apache+apex
• Free Enterprise License for Startups -
https://www.datatorrent.com/product/startup-accelerator/

More Related Content

PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PDF
Building your first aplication using Apache Apex
PPTX
Introduction to Apache Apex
PPTX
Intro to Apache Apex @ Women in Big Data
PPTX
Java High Level Stream API
Big Data Berlin v8.0 Stream Processing with Apache Apex
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
Building your first aplication using Apache Apex
Introduction to Apache Apex
Intro to Apache Apex @ Women in Big Data
Java High Level Stream API

What's hot (20)

PPTX
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Introduction to Apache Apex and writing a big data streaming application
PPTX
Introduction to Real-Time Data Processing
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PDF
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PDF
Low Latency Polyglot Model Scoring using Apache Apex
PPTX
Capital One's Next Generation Decision in less than 2 ms
PPTX
Smart Partitioning with Apache Apex (Webinar)
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Apache Beam (incubating)
PDF
Developing streaming applications with apache apex (strata + hadoop world)
PPTX
DataTorrent Presentation @ Big Data Application Meetup
PPTX
Next Gen Big Data Analytics with Apache Apex
PDF
Introduction to Apache Apex - CoDS 2016
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
PPTX
Fault Tolerance and Processing Semantics in Apache Apex
PPTX
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Introduction to Apache Apex and writing a big data streaming application
Introduction to Real-Time Data Processing
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
Capital One's Next Generation Decision in less than 2 ms
Smart Partitioning with Apache Apex (Webinar)
Apache Apex: Stream Processing Architecture and Applications
Apache Beam (incubating)
Developing streaming applications with apache apex (strata + hadoop world)
DataTorrent Presentation @ Big Data Application Meetup
Next Gen Big Data Analytics with Apache Apex
Introduction to Apache Apex - CoDS 2016
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Fault Tolerance and Processing Semantics in Apache Apex
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Ad

Viewers also liked (7)

PDF
Crossfilter MadJS
PPTX
Multi dimension aggregations using spark and dataframes
PPT
Aggregate fact tables
PDF
Elasticsearch Introduction to Data model, Search & Aggregations
PDF
Elasticsearch in Netflix
PPT
Datacube
PDF
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Crossfilter MadJS
Multi dimension aggregations using spark and dataframes
Aggregate fact tables
Elasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch in Netflix
Datacube
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Ad

Similar to Ingestion and Dimensions Compute and Enrich using Apache Apex (20)

PDF
Real Time Insights for Advertising Tech
PPTX
Apache Apex: Stream Processing Architecture and Applications
PDF
BigDataSpain 2016: Introduction to Apache Apex
PDF
Introduction to Apache Apex by Thomas Weise
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
PPTX
Stream data from Apache Kafka for processing with Apache Apex
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PDF
Drinking from the Firehose - Real-time Metrics
PPTX
Data Stream Processing with Apache Flink
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
PDF
Apache Pulsar Overview
PDF
From Batch to Streaming ET(L) with Apache Apex
PDF
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
PDF
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
PPTX
data Artisans Product Announcement
PDF
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
PPT
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Real Time Insights for Advertising Tech
Apache Apex: Stream Processing Architecture and Applications
BigDataSpain 2016: Introduction to Apache Apex
Introduction to Apache Apex by Thomas Weise
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Stream data from Apache Kafka for processing with Apache Apex
Streaming in Practice - Putting Apache Kafka in Production
Drinking from the Firehose - Real-time Metrics
Data Stream Processing with Apache Flink
Capital One Delivers Risk Insights in Real Time with Stream Processing
Flexible and Real-Time Stream Processing with Apache Flink
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
Apache Pulsar Overview
From Batch to Streaming ET(L) with Apache Apex
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
data Artisans Product Announcement
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)

More from Apache Apex (14)

PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Deep Dive into Apache Apex App Development
PPTX
Hadoop Interacting with HDFS
PPTX
Introduction to Yarn
PPTX
Introduction to Map Reduce
PPTX
HDFS Internals
PPTX
Intro to Big Data Hadoop
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PPTX
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
PPTX
Apache Apex & Bigtop
PDF
Building Your First Apache Apex Application
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Deep Dive into Apache Apex App Development
Hadoop Interacting with HDFS
Introduction to Yarn
Introduction to Map Reduce
HDFS Internals
Intro to Big Data Hadoop
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex & Bigtop
Building Your First Apache Apex Application

Recently uploaded (20)

PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
SOPHOS-XG Firewall Administrator PPT.pptx
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Group 1 Presentation -Planning and Decision Making .pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
Spectroscopy.pptx food analysis technology
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...

Ingestion and Dimensions Compute and Enrich using Apache Apex

  • 1. Apache Apex Intro to Apex Ingestion and Dimensions Compute for a customer use-case Devendra Tagare [email protected] @devtagare 9h July 2016
  • 2. What is Apex 2 • Platform and runtime engine that enables development of scalable and fault- tolerant distributed applications • Hadoop native • Process streaming or batch big data • High throughput and low latency • Library of commonly needed business logic • Write any custom business logic in your application
  • 3. Applications on Apex 3 • Distributed processing • Application logic broken into components called operators that run in a distributed fashion across your cluster • Scalable • Operators can be scaled up or down at runtime according to the load and SLA • Fault tolerant • Automatically recover from node outages without having to reprocess from beginning • State is preserved • Long running applications • Operators • Use library to build applications quickly • Write your own in Java using the API • Operational insight – DataTorrent RTS • See how each operator is performing and even record data
  • 6. Native Hadoop Integration 6 • YARN is the resource manager • HDFS used for storing any persistent state
  • 7. Application Development Model 7  A Stream is a sequence of data tuples  A typical Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded  Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output StreamTuple Tuple er Operator er Operator er Operator er Operator er Operator er Operator
  • 8. Advanced Windowing Support 8  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  • 10. Partitioning and unification 10 NxM PartitionsUnifier 0 1 2 3 Logical DAG 0 1 2 1 1 Unifier 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 Unifier 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck Unifier Unifier0 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  • 11. Advanced Partitioning 11 0 1a 1b 2 3 4Unifier Physical DAG 0 4 3a2a1a 1b 2b 3b Unifier Physical DAG with Parallel Partition Parallel Partition Container uopr uopr1 uopr2 uopr3 uopr4 uopr1 uopr2 uopr3 uopr4 dopr dopr doprunifier unifier unifier unifier Container Container NICNIC NICNIC NIC Container NIC Logical Plan Execution Plan, for N = 4; M = 1 Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers Cascading Unifiers 0 1 2 3 4 Logical DAG
  • 12. Dynamic Partitioning 12 • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 2b 2c 3 2a 2d 1b 1a1a 2a 1b 2b 3 1a 2b 1b 2c 3b 2a 2d 3a Unifiers not shown
  • 13. • Ingest from Kafka and S3 • Parse, Filter and Enrich • Dimensional compute for key performance indicators • Reporting of critical metrics around campaign monetization • Aggregate counters & reporting on top N metrics • Low latency querying using Kafka in pub-sub model Use Case ...
  • 15. Proprietary and Confidential Scale • 6 geographically distributed data centers • Combination of co-located & AWS based DC’s. • > 5 PB under the data management • 22 TB / day data generated from auction & client logs • heterogeneous data log formats • North of 15 Bn impressions / day. • Average data inflow of 200K events/s 15
  • 16. • Ad server log events consumed as Avro encoded, Snappy compressed files from S3. New files uploaded every 10-20 minutes. • Data may arrive in S3 out of order (time stamps). • Event size is about 2KB uncompressed, only subset of fields retrieved for aggregation. • Aggregates kept in memory (checkpointed) with expiration policy, query processing against in memory data. • Front-end integration through Kafka based query protocol for realtime dashboard components. Initial Requirements
  • 17. Proprietary and Confidential Apache Apex 17 AdServer REST proxy REST proxy Real-time architecture- Powered By Apex Kafka Cluster S3Reader S3Reader Filter Operator Filter Operator Dimensions Aggregator Dimensions Aggregator Dimensions Store Query Query Result Kafka Cluster Auction Logs Middleware Auction Logs Filtered Events Filtered Events Aggregates Query from MW Query Query Results S3 S3 Client logsAuction Logs Architecture 1.0 - Batch Reads + Streaming Aggregations
  • 18. • Unstable S3 client libraries – Unpredictable hangs and Corrupted data – On Hang, Master kills the container and restart reading of file from different container – Corrupt files caused containers to kill – application configurable retry mechanism and skip bad files – Limited read consumption throughput – 1 reader/file • Out of Order data – Some timestamp in future and past • Spike in load when new files are added followed by period of inactivity • Memory Requirement for Store – Cardinality Estimation for incoming data Challenges
  • 19. Proprietary and Confidential Apache Apex 19 REST proxy Real-time architecture- Powered By Apex Client logs Kafka Input (Auction logs) ETL operator Filter Operator Filter Operator Dimensions Aggregator Dimensions Aggregator Dimensions Store/HDHT Query Query Result Kafka Cluster Auction Logs Kafka Cluster Middleware AdServer REST proxy Kafka Cluster Auction Logs Client logs Kafka Messages Decompress & Flatten Filtered Events Filtered Events Aggregates Query from MW Query Query Results S3 S3Reader Kafka Input (Auction logs) Auction Logs Architecture 2.0 - Batch + Streaming
  • 20. Challenges • Complex Logical DAG • Kafka Operator Issues – Dynamic Partitioning – Memory Configuration – Offset snapshotting to ensure exactly once semantics • Resource Allocation – More memory requirement for Store (Large number of Unifiers) • Harder Debugging (More number of components) – GB(s) of container logs – Difficult to locate the sequence of failure • More of data transferred over wire within cluster • Limit Kafka read rate
  • 21. Proprietary and Confidential 21 User Browser AdServer REST proxy REST proxy Real-time architecture- Powered By Apex Kafka Cluster Client logs Kafka Input (Auction logs) Kafka Input (Client logs) CDN (Caching of logs) ETL operator ETL operator Filter Operator Filter Operator Dimensions Aggregator Dimensions Aggregator Dimensions Store Query Query Result Kafka Cluster Auction Logs Client logs Middleware Auction Logs Client logs Kafka Messages Kafka Messages Decompress & Flatten Decompress & Flatten Filtered Events Filtered Events Aggregates Query from MW Query Query Results Architecture 3.0 - Streaming
  • 23. Proprietary and Confidential Application Configuration • 64 Kafka Input operators reading from 6 geographically distributed DC’s • Under 40 seconds end to end latency - from ad-serving to visualization • 32 instances of in-memory distribute store • 64 aggregators • 1.2 TB memory footprint @ peak load • In-memory store was later replaced by HDHT for fault tolerance 23
  • 24. Proprietary and Confidential Learning’s • DAG – sizing, locality & partitioning (Benchmark) • Memory sizing for the store or other memory heavy operators. • Cardinality estimation for incoming data is critical. • Upstream operators tend to require more memory than down-stream operators for high velocity reads. • Back pressure from down-stream failures due to skew in velocity of events & upstream failures .. Buffer Server sizing is critical. • For end to end exactly once its necessary to understand the external systems semantics & delivery guarantees. • Think fault tolerance & recovery before starting implementation. 24
  • 25. Proprietary and Confidential Before And After 25 5 Hours + 20 Minute • No real-time processing system in place. • Publishers and buyers could only rely on a batch processing system for gathering relevant data • Outdated data, not relevant to current time • Current data being pushed to a waiting queue • Cumbersome batch-processing lifecycle • No visualization for reports • No glimpse into everyday happenings, translating to lost decisions or untimely decision making scenarios Before Scenario After Scenario • Phase 1 • With DataTorrent RTS (built on Apache Apex), Dev team put together the first real time analytics platform • This enabled Reporting of critical metrics around campaign monetization • Reuse of batch ingestion mechanism for the impression data, shared with other pipelines (S3) ~ 30 seconds No Real-time Batch + Real-time • Phase 2 • Reduce end-to-end latency through real-time ingestion of impression data from Kafka • Results available much sooner to the user • Balances load (no more batch ingestion spikes), reduces resource consumption • Handles ever growing traffic with more efficient resource utilization. Real-time Streaming
  • 26. Proprietary and Confidential Operators used S3 reader (File Input Operator) • Recursively reading the contents of a S3 bucket based on a partitioning pattern • Inclusion & exclusion support • Fault tolerance (replay and idempotent) • Throughput of over 12K reads/second for event size of 1.2 KB each Kafka Input Operator • Ability to consume from multiple Kafka clusters • Offset management support • Fault tolerant reads • Support for idempotent & exactly once semantics • Controlled reads for managing back-pressure POJO Enrichment Operator • takes a POJO as input and does a look-up in a store for given key • supports caching • stores are pluggable • App builder ready 26
  • 27. Proprietary and Confidential Operators used (cont …) Parser • Specify JSON schema • Emits a POJO based on the output schema • No user code required Dimension Store • Distributed in-memory store • Supports re-aggregation of events • Partitioning of aggregates per view • Low latency query support with a pub/sub model using Kafka HDHT • HDFS backed embedded key-value store • Fault tolerant, random read & write • Durability in-case of cold restarts 27
  • 28. Dimensional Model - Key Concepts Metrics : pieces of information we want to collect statistics about. Dimensions : variables which can impact our measures. Combinations : set of dimensions for which one or metric would be aggregated.They are sub-sets of dimensions. Aggregations : the aggregate function eg.. SUM, TOPN, Standard deviation. Example : Dimensions - campaignId, advertiserId, time Metrics - Cost, revenue, clicks, impressions Aggregate functions -SUM,AM etc.. Combinations : 1. campaignId x time - cost,revenue 2. advertiser - revenue, impressions 3. campaignId x advertiser x time - revenue, clicks, impressions How to aggregate on the combinations ?
  • 30. Proprietary and Confidential More Use-cases • Real-time Monitoring Alerts on deal tracking & monetization Campaign & deal health • Real-time Learning Using the lost bid insights for price recommendations. • Allocation Engine Feedback to ad serving for guaranteed delivery & line item pacing 30
  • 31. Data Processing Pipeline Example App Builder 31
  • 34. Real-Time Dashboards Real Time Visualization 34
  • 36. Resources 36 • http://apex.apache.org/ • Learn more: http://apex.apache.org/docs.html • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups – http://www.meetup.com/pro/apacheapex/ • More examples: https://github.com/DataTorrent/examples • Slideshare: http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/

Editor's Notes

  • #27: Thomas – Mention these are extensions of malhar (Open Source)