Ingestion and Dimensions Compute and Enrich using Apache Apex

Apache Apex
Intro to Apex
Ingestion and Dimensions Compute for a customer use-case
Devendra Tagare
devendrat@datatorrent.com
@devtagare
9h July 2016

What is Apex
2
• Platform and runtime engine that enables development of scalable and fault-
tolerant distributed applications
• Hadoop native
• Process streaming or batch big data
• High throughput and low latency
• Library of commonly needed business logic
• Write any custom business logic in your application

Applications on Apex
3
• Distributed processing
• Application logic broken into components called operators that run in a distributed fashion across your
cluster
• Scalable
• Operators can be scaled up or down at runtime according to the load and SLA
• Fault tolerant
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved
• Long running applications
• Operators
• Use library to build applications quickly
• Write your own in Java using the API
• Operational insight – DataTorrent RTS
• See how each operator is performing and even record data

Apex Operator Library - Malhar
5

Native Hadoop Integration
6
• YARN is
the
resource
manager
• HDFS
used for
storing
any
persistent
state

Application Development Model
7
 A Stream is a sequence of data tuples
 A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
 Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
StreamTuple Tuple er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator

Advanced Windowing Support
8
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency

Partitioning and unification
10
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier

Advanced Partitioning
11
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG

Dynamic Partitioning
12
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown

• Ingest from Kafka and S3
• Parse, Filter and Enrich
• Dimensional compute for key performance indicators
• Reporting of critical metrics around campaign monetization
• Aggregate counters & reporting on top N metrics
• Low latency querying using Kafka in pub-sub model
Use Case ...

Proprietary and Confidential
Scale
• 6 geographically distributed data centers
• Combination of co-located & AWS based DC’s.
• > 5 PB under the data management
• 22 TB / day data generated from auction & client logs
• heterogeneous data log formats
• North of 15 Bn impressions / day.
• Average data inflow of 200K events/s
15

• Ad server log events consumed as Avro encoded, Snappy compressed files
from S3. New files uploaded every 10-20 minutes.
• Data may arrive in S3 out of order (time stamps).
• Event size is about 2KB uncompressed, only subset of fields retrieved for
aggregation.
• Aggregates kept in memory (checkpointed) with expiration policy, query
processing against in memory data.
• Front-end integration through Kafka based query protocol for realtime
dashboard components.
Initial Requirements

Apache Apex
17
AdServer
REST proxy
REST proxy
Real-time architecture- Powered By Apex
Kafka
Cluster
S3Reader
S3Reader
Filter Operator Filter Operator
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions Store
Query Query
Result
Kafka
Cluster
Auction Logs
Middleware
Auction Logs
Filtered Events Filtered Events
Aggregates
Query from MW
Query Query Results
S3 S3 Client logsAuction Logs
Architecture 1.0 - Batch Reads + Streaming Aggregations

• Unstable S3 client libraries
– Unpredictable hangs and Corrupted data
– On Hang, Master kills the container and restart reading of file from different container
– Corrupt files caused containers to kill – application configurable retry mechanism and skip bad
files
– Limited read consumption throughput – 1 reader/file
• Out of Order data
– Some timestamp in future and past
• Spike in load when new files are added followed by period of inactivity
• Memory Requirement for Store
– Cardinality Estimation for incoming data
Challenges

Apache Apex
19
REST proxy
Client logs
Kafka
Input
(Auction logs)
ETL operator
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions
Store/HDHT
Query Query
Result
Kafka
Cluster
Auction Logs
Kafka
Cluster
Middleware
AdServer
REST proxy
Kafka
Cluster
Auction
Logs
Client logs
Kafka Messages
Decompress
& Flatten
Aggregates
Query from MW
Query Query Results
S3
S3Reader
Kafka
Input
(Auction logs)
Auction
Logs
Architecture 2.0 - Batch + Streaming

Challenges
• Complex Logical DAG
• Kafka Operator Issues
– Dynamic Partitioning
– Memory Configuration
– Offset snapshotting to ensure exactly once semantics
• Resource Allocation
– More memory requirement for Store (Large number of Unifiers)
• Harder Debugging (More number of components)
– GB(s) of container logs
– Difficult to locate the sequence of failure
• More of data transferred over wire within cluster
• Limit Kafka read rate

Proprietary and Confidential 21
User
Browser
AdServer
REST proxy
REST proxy
Kafka
Cluster
Client logs
Kafka
Input
(Auction logs)
Kafka
Input
(Client logs)
CDN
(Caching of
logs)
ETL operator ETL operator
Dimensions
Aggregator
Dimensions
Aggregator
Dimensions Store
Query Query
Result
Kafka
Cluster
Auction Logs
Client logs
Middleware
Auction Logs
Client logs
Kafka Messages Kafka Messages
Decompress
& Flatten
Decompress
& Flatten
Aggregates
Query from MW
Query Query Results
Architecture 3.0 - Streaming

Application Configuration
• 64 Kafka Input operators reading from 6 geographically distributed DC’s
• Under 40 seconds end to end latency - from ad-serving to visualization
• 32 instances of in-memory distribute store
• 64 aggregators
• 1.2 TB memory footprint @ peak load
• In-memory store was later replaced by HDHT for fault tolerance
23

Learning’s
• DAG – sizing, locality & partitioning (Benchmark)
• Memory sizing for the store or other memory heavy operators.
• Cardinality estimation for incoming data is critical.
• Upstream operators tend to require more memory than down-stream
operators for high velocity reads.
• Back pressure from down-stream failures due to skew in velocity of events
& upstream failures .. Buffer Server sizing is critical.
• For end to end exactly once its necessary to understand the external
systems semantics & delivery guarantees.
• Think fault tolerance & recovery before starting implementation.
24

Before And After
25
5 Hours + 20 Minute
• No real-time processing system in place.
• Publishers and buyers could only rely on a
batch processing system for gathering
relevant data
• Outdated data, not relevant to
current time
• Current data being pushed to a
waiting queue
• Cumbersome batch-processing
lifecycle
• No visualization for reports
• No glimpse into everyday
happenings, translating to lost
decisions or untimely decision
making scenarios
Before Scenario After Scenario
• Phase 1
• With DataTorrent RTS (built
on Apache Apex), Dev team
put together the first real time
analytics platform
• This enabled Reporting of
critical metrics around
campaign monetization
• Reuse of batch ingestion
mechanism for the impression
data, shared with other
pipelines (S3)
~ 30 seconds
No Real-time Batch + Real-time
• Phase 2
• Reduce end-to-end latency
through real-time ingestion of
impression data from Kafka
• Results available much sooner
to the user
• Balances load (no more batch
ingestion spikes), reduces
resource consumption
• Handles ever growing traffic
with more efficient resource
utilization.
Real-time Streaming

Operators used
S3 reader (File Input Operator)
• Recursively reading the contents of a S3 bucket based on a partitioning pattern
• Inclusion & exclusion support
• Fault tolerance (replay and idempotent)
• Throughput of over 12K reads/second for event size of 1.2 KB each
Kafka Input Operator
• Ability to consume from multiple Kafka clusters
• Offset management support
• Fault tolerant reads
• Support for idempotent & exactly once semantics
• Controlled reads for managing back-pressure
POJO Enrichment Operator
• takes a POJO as input and does a look-up in a store for given key
• supports caching
• stores are pluggable
• App builder ready
26

Operators used (cont …)
Parser
• Specify JSON schema
• Emits a POJO based on the output schema
• No user code required
Dimension Store
• Distributed in-memory store
• Supports re-aggregation of events
• Partitioning of aggregates per view
• Low latency query support with a pub/sub model using Kafka
HDHT
• HDFS backed embedded key-value store
• Fault tolerant, random read & write
• Durability in-case of cold restarts
27

Dimensional Model - Key Concepts
Metrics : pieces of information we want to collect statistics about.
Dimensions : variables which can impact our measures.
Combinations : set of dimensions for which one or metric would be aggregated.They are sub-sets of
dimensions.
Aggregations : the aggregate function eg.. SUM, TOPN, Standard deviation.
Example :
Dimensions - campaignId, advertiserId, time
Metrics - Cost, revenue, clicks, impressions
Aggregate functions -SUM,AM etc..
Combinations :
1. campaignId x time - cost,revenue
2. advertiser - revenue, impressions
3. campaignId x advertiser x time - revenue, clicks, impressions
How to aggregate on the combinations ?

Dimensional Model
Dimensions Schema
{"keys":[{"name":"campaignId","type":"integer"},
{"name":"adId","type":"integer"},
{"name":"creativeId","type":"integer"},
{"name":"publisherId","type":"integer"},
{"name":"adOrderId","type":"integer"}],
"timeBuckets":["1h","1d"],
"values":
[{"name":"impressions","type":"integer","aggregators":["SUM"]},
{"name":"clicks","type":"integer","aggregators":["SUM"]},
{"name":"revenue","type":"integer"}],
"dimensions":
[{"combination":["campaignId","adId"]},
{"combination":["creativeId","campaignId"]},
{"combination":["campaignId"]},
{"combination":["publisherId","adOrderId","campaignId"],"additionalValues":["revenue:SUM"]}]
}

More Use-cases
• Real-time Monitoring
Alerts on deal tracking & monetization
Campaign & deal health
• Real-time Learning
Using the lost bid insights for price recommendations.
• Allocation Engine
Feedback to ad serving for guaranteed delivery & line item pacing
30

Data Processing Pipeline Example
App Builder
31

Monitoring Console
Logical View
32

Monitoring Console
Physical View
33

Real-Time Dashboards
Real Time Visualization
34

Resources
36
• http://apex.apache.org/
• Learn more: http://apex.apache.org/docs.html
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups – http://www.meetup.com/pro/apacheapex/
• More examples: https://github.com/DataTorrent/examples
• Slideshare: http://www.slideshare.net/ApacheApex/presentations
• https://www.youtube.com/results?search_query=apache+apex
• Free Enterprise License for Startups -
https://www.datatorrent.com/product/startup-accelerator/

Ingestion and Dimensions Compute and Enrich using Apache Apex

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Ingestion and Dimensions Compute and Enrich using Apache Apex (20)

More from Apache Apex (14)

Recently uploaded (20)

Ingestion and Dimensions Compute and Enrich using Apache Apex

Editor's Notes