SlideShare a Scribd company logo
Building Streaming Applications with
Apache Apex
Thomas Weise <thw@apache.org> @thweise
PMC Chair Apache Apex, Architect DataTorrent
Big Data Spain, Madrid, Nov 18th 2016
Agenda
2
• Application Development Model
• Creating Apex Application - Project Structure
• Application Developer API
• Configuration Example
• Operator Developer API
• Overview of Operator Library
• Frequently used Connectors
• Stateful Transformation & Windowing
• Scalability – Partitioning
• End-to-end Exactly Once
Application Development Model
3
▪Stream is a sequence of data tuples
▪Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
▪Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
Creating Apex Application Project
4
chinmay@chinmay-VirtualBox:~/src$ mvn archetype:generate -DarchetypeGroupId=org.apache.apex -
DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=LATEST -DgroupId=com.example -
Dpackage=com.example.myapexapp -DartifactId=myapexapp -Dversion=1.0-SNAPSHOT
…
…
...
Confirm properties configuration:
groupId: com.example
artifactId: myapexapp
version: 1.0-SNAPSHOT
package: com.example.myapexapp
archetypeVersion: LATEST
Y: : Y
…
…
...
[INFO] project created from Archetype in dir: /media/sf_workspace/src/myapexapp
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13.141 s
[INFO] Finished at: 2016-11-15T14:06:56+05:30
[INFO] Final Memory: 18M/216M
[INFO] ------------------------------------------------------------------------
chinmay@chinmay-VirtualBox:~/src$
https://www.youtube.com/watch?v=z-eeh-tjQrc
Apex Application Project Structure
5
• pom.xml
•Defines project structure and
dependencies
•Application.java
•Defines the DAG
•RandomNumberGenerator.java
•Sample Operator
•properties.xml
•Contains operator and application
properties and attributes
•ApplicationTest.java
•Sample test to test application in local
mode
API: Compositional (low level)
6
Input Parser Counter Output
CountsWordsLines
Kafka Database
Filter
Filtered
API: Declarative (high level)
7
File
Input
Parser
Word
Counter
Console
Output
CountsWordsLines
Folder StdOut
StreamFactory.fromFolder("/tmp")
.flatMap(input -> Arrays.asList(input.split(" ")), name("Words"))
.window(new WindowOption.GlobalWindow(),
new TriggerOption().accumulatingFiredPanes().withEarlyFiringsAtEvery(1))
.countByKey(input -> new Tuple.PlainTuple<>(new KeyValPair<>(input, 1L)), name("countByKey"))
.map(input -> input.getValue(), name("Counts"))
.print(name("Console"))
.populateDag(dag);
API: SQL
8
Kafka
Input
CSV
Parser
Filter CSV
Formattter
FilteredWordsLines
Kafka File
Project
Projected Line
Writer
Formatted
SQLExecEnvironment.getEnvironment()
.registerTable("ORDERS",
new KafkaEndpoint(conf.get("broker"), conf.get("topic"),
new CSVMessageFormat(conf.get("schemaInDef"))))
.registerTable("SALES",
new FileEndpoint(conf.get("destFolder"), conf.get("destFileName"),
new CSVMessageFormat(conf.get("schemaOutDef"))))
.registerFunction("APEXCONCAT", this.getClass(), "apex_concat_str")
.executeSQL(dag,
"INSERT INTO SALES " +
"SELECT STREAM ROWTIME, FLOOR(ROWTIME TO DAY), APEXCONCAT('OILPAINT', SUBSTRING(PRODUCT, 6, 7) " +
"FROM ORDERS WHERE ID > 3 AND PRODUCT LIKE 'paint%'");
API: Beam
9
• Apex Runner for Apache Beam is now available!!
•Build once run-anywhere model
•Beam Streaming applications can be run on apex runner:
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
// Run with Apex runner
options.setRunner(ApexRunner.class);
Pipeline p = Pipeline.create(options);
p.apply("ReadLines", TextIO.Read.from(options.getInput()))
.apply(new CountWords())
.apply(MapElements.via(new FormatAsTextFn()))
.apply("WriteCounts", TextIO.Write.to(options.getOutput()));
.run().waitUntilFinish();
}
API: SAMOA
10
•Build once run-anywhere model for online machine learning algorithms
•Any machine learning algorithm present in SAMOA can be run directly on Apex.
•Uses Apex Iteration Support
•Following example does classification of input data from HDFS using VHT algorithm on
Apex:
ᵒ bin/samoa apex ../SAMOA-Apex-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation
-d /tmp/dump.csv
-l (classifiers.trees.VerticalHoeffdingTree -p 1)
-s (org.apache.samoa.streams.ArffFileStream
-s HDFSFileStreamSource
-f /tmp/user/input/covtypeNorm.arff)"
Configuration (properties.xml)
11
Input Parser Counter Output
CountsWordsLines
Kafka Database
Filter
Filtered
Streaming Window
Processing Time
12
•Finite time sliced windows based on processing (event arrival) time
•Used for bookkeeping of streaming application
•Derived Windows are: Checkpoint Windows, Committed Windows
Operator API
13
Next
streaming
window
Next
streaming
window
Input Adapters - Starting of the pipeline. Interacts with external system to generate stream
Generic Operators - Processing part of pipeline
Output Adapters - Last operator in pipeline. Interacts with external system to finalize the processed stream
OutputPort::emit()
Operator Library
14
RDBMS
• JDBC
• MySQL
• Oracle
• MemSQL
NoSQL
• Cassandra, HBase
• Aerospike, Accumulo
• Couchbase/ CouchDB
• Redis, MongoDB
• Geode
Messaging
• Kafka
• JMS (ActiveMQ etc.)
• Kinesis, SQS
• Flume, NiFi
File Systems
• HDFS/ Hive
• Local File
• S3
Parsers
• XML
• JSON
• CSV
• Avro
• Parquet
Transformations
• Filters, Expression, Enrich
• Windowing, Aggregation
• Join
• Dedup
Analytics
• Dimensional Aggregations
(with state management for
historical data + query)
Protocols
• HTTP
• FTP
• WebSocket
• MQTT
• SMTP
Other
• Elastic Search
• Script (JavaScript, Python, R)
• Solr
• Twitter
Frequently used Connectors
Kafka Input
15
KafkaSinglePortInputOperator KafkaSinglePortByteArrayInputOperator
Library malhar-contrib malhar-kafka
Kafka Consumer 0.8 0.9
Emit Type byte[] byte[]
Fault-Tolerance At Least Once, Exactly Once At Least Once, Exactly Once
Scalability Static and Dynamic (with Kafka
metadata)
Static and Dynamic (with Kafka metadata)
Multi-Cluster/Topic Yes Yes
Idempotent Yes Yes
Partition Strategy 1:1, 1:M 1:1, 1:M
Frequently used Connectors
Kafka Output
16
KafkaSinglePortOutputOperator KafkaSinglePortExactlyOnceOutputOperator
Library malhar-contrib malhar-kafka
Kafka Producer 0.8 0.9
Fault-Tolerance At Least Once At Least Once, Exactly Once
Scalability Static and Dynamic (with Kafka
metadata)
Static and Dynamic, Automatic Partitioning
based on Kafka metadata
Multi-Cluster/Topic Yes Yes
Idempotent Yes Yes
Partition Strategy 1:1, 1:M 1:1, 1:M
Frequently used Connectors
File Input
17
•AbstractFileInputOperator
•Used to read a file from source and
emit the content of the file to
downstream operator
•Operator is idempotent
•Supports Partitioning
•Concrete classes:
•FileLineInputOperator
•AvroFileInputOperator
•ParquetFilePOJOReader
•https://www.datatorrent.com/blog/f
ault-tolerant-file-processing/
Frequently used Connectors
File Output
18
•AbstractFileOutputOperator
•Writes data to a file
•Supports Partitions
•Exactly-once results
•Upstream operators should be
idempotent
•Concrete classes:
•StringFileOutputOperator
•https://www.datatorrent.com/blog/f
ault-tolerant-file-processing/
Windowing Support
19
• Event-time Windows
• Computation based on event-time present in the tuple
• Types of windows:
ᵒ Global : Single event-time window throughout the lifecycle of application
ᵒ Timed : Tuple is assigned to single, non-overlapping, fixed width windows
immediately followed by next window
ᵒ Sliding Time : Tuple is can be assigned to multiple, overlapping fixed width windows.
ᵒ Session : Tuple is assigned to single, variable width windows with predefined min gap
Stateful Windowed Processing
20
• WindowedOperator (org.apache.apex:malhar-library)
• Used to process data based on Event time as contrary to ingression time
• Supports windowing semantics of Apache Beam model
• Features:
ᵒ Watermarks
ᵒ Allowed Lateness
ᵒ Accumulation
ᵒ Accumulation Modes: Accumulating, Discarding, Accumulating & Retracting
ᵒ Triggers
• Storage
ᵒ In memory (checkpointed)
ᵒ Managed State
Stateful Windowed Processing
Compositional API
21
@Override
public void populateDAG(DAG dag, Configuration configuration)
{
WordGenerator inputOperator = new WordGenerator();
KeyedWindowedOperatorImpl windowedOperator = new KeyedWindowedOperatorImpl();
Accumulation<Long, MutableLong, Long> sum = new SumAccumulation();
windowedOperator.setAccumulation(sum);
windowedOperator.setDataStorage(new InMemoryWindowedKeyedStorage<String, MutableLong>());
windowedOperator.setRetractionStorage(new InMemoryWindowedKeyedStorage<String, Long>());
windowedOperator.setWindowStateStorage(new InMemoryWindowedStorage<WindowState>());
windowedOperator.setWindowOption(new WindowOption.TimeWindows(Duration.standardMinutes(1)));
windowedOperator.setTriggerOption(TriggerOption.AtWatermark()
.withEarlyFiringsAtEvery(Duration.millis(1000))
.accumulatingAndRetractingFiredPanes());
windowedOperator.setAllowedLateness(Duration.millis(14000));
ConsoleOutputOperator outputOperator = new ConsoleOutputOperator();
dag.addOperator("inputOperator", inputOperator);
dag.addOperator("windowedOperator", windowedOperator);
dag.addOperator("outputOperator", outputOperator);
dag.addStream("input_windowed", inputOperator.output, windowedOperator.input);
dag.addStream("windowed_output", windowedOperator.output, outputOperator.input);
}
Stateful Windowed Processing
Declarative API
22
StreamFactory.fromFolder("/tmp")
.flatMap(input -> Arrays.asList(input.split(" ")), name("ExtractWords"))
.map(input -> new TimestampedTuple<>(System.currentTimeMillis(), input), name("AddTimestampFn"))
.window(new TimeWindows(Duration.standardMinutes(WINDOW_SIZE)),
new TriggerOption().accumulatingFiredPanes().withEarlyFiringsAtEvery(1))
.countByKey(input -> new TimestampedTuple<>(input.getTimestamp(), new KeyValPair<>(input.getValue(),
1L))), name("countWords"))
.map(new FormatAsTableRowFn(), name("FormatAsTableRowFn"))
.print(name("console"))
.populateDag(dag);
Scalability - Partitioning
23
0 1 2 3
Logical DAG
0 1 2 U
Physical DAG
1
1 2
2
3
Parallel
Partitions
M x N
Partitions
OR
Shuffle
<configuration>
<property>
<name>dt.operator.1.attr.PARTITIONER</name>
<value>com.datatorrent.common.partitioner.StatelessPartitioner:3</value>
</property>
<property>
<name>dt.operator.2.port.inputPortName.attr.PARTITION_PARALLEL</name>
<value>true</value>
</property>
</configuration>
How tuples are split between partitions
24
• Tuple hashcode and mask used to determine destination partition
ᵒ Mask picks the last n bits of the hashcode of the tuple
ᵒ hashcode method can be overridden
• StreamCodec can be used to specify custom hashcode for tuples
ᵒ Can also be used for specifying custom serialization
tuple: {
Name,
24204842, San
Jose
}
Hashcode:
0010101000101
01
Mask (0x11) Partition
00 1
01 2
10 3
11 4
End-to-End Exactly-Once
25
Input Counter Store
Aggregate
CountsWords
Kafka Database
● Input
○ Uses com.datatorrent.contrib.kafka.KafkaSinglePortStringInputOperator
○ Emits words as a stream
○ Operator is idempotent
● Counter
○ com.datatorrent.lib.algo.UniqueCounter
● Store
○ Uses CountStoreOperator
○ Inserts into JDBC
○ Exactly-once results (End-To-End Exactly-once = At-least-once + Idempotency + Consistent State)
https://github.com/DataTorrent/examples/blob/master/tutorials/exactly-once
https://www.datatorrent.com/blog/end-to-end-exactly-once-with-apache-apex/
End-to-End Exactly-Once (contd.)
26
Input Counter Store
Aggregate
CountsWords
Kafka Database
public static class CountStoreOperator extends AbstractJdbcTransactionableOutputOperator<KeyValPair<String, Integer>>
{
public static final String SQL =
"MERGE INTO words USING (VALUES ?, ?) I (word, wcount)"
+ " ON (words.word=I.word)"
+ " WHEN MATCHED THEN UPDATE SET words.wcount = words.wcount + I.wcount"
+ " WHEN NOT MATCHED THEN INSERT (word, wcount) VALUES (I.word, I.wcount)";
@Override
protected String getUpdateCommand()
{
return SQL;
}
@Override
protected void setStatementParameters(PreparedStatement statement, KeyValPair<String, Integer> tuple) throws SQLException
{
statement.setString(1, tuple.getKey());
statement.setInt(2, tuple.getValue());
}
}
End-to-End Exactly-Once (contd.)
27
https://www.datatorrent.com/blog/fault-tolerant-file-processing/
Who is using Apex?
28
• Powered by Apex
ᵒ http://apex.apache.org/powered-by-apex.html
ᵒ Also using Apex? Let us know to be added: users@apex.apache.org or @ApacheApex
• Pubmatic
ᵒ https://www.youtube.com/watch?v=JSXpgfQFcU8
• GE
ᵒ https://www.youtube.com/watch?v=hmaSkXhHNu0
ᵒ http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-
using-apache-apex-hadoop
• SilverSpring Networks
ᵒ https://www.youtube.com/watch?v=8VORISKeSjI
ᵒ http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by-
silver-spring-networks
Resources
29
• http://apex.apache.org/
• Learn more - http://apex.apache.org/docs.html
• Getting involved - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Follow @ApacheApex - https://twitter.com/apacheapex
• Meetups - https://www.meetup.com/topics/apache-apex/
• Examples - https://github.com/DataTorrent/examples
• Slideshare - http://www.slideshare.net/ApacheApex/presentations
• https://www.youtube.com/results?search_query=apache+apex
• Free Enterprise License for Startups -
https://www.datatorrent.com/product/startup-accelerator/
Q&A
30

More Related Content

What's hot (20)

PDF
The good, the bad, and the ugly of Java API design
Miro Cupak
 
PPTX
What’s expected in Spring 5
Gal Marder
 
PDF
The good, the bad, and the ugly of Java API design
Miro Cupak
 
PDF
Micro-Benchmarking Considered Harmful
Thomas Wuerthinger
 
PDF
Production Readiness Testing At Salesforce Using Spark MLlib
Spark Summit
 
PDF
SSR: Structured Streaming for R and Machine Learning
felixcss
 
PPTX
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Formulatedby
 
PDF
Scalable Data Science in Python and R on Apache Spark
felixcss
 
PPTX
Evolving Streaming Applications
DataWorks Summit
 
PDF
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward
 
PPTX
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
PPTX
ORC improvement in Apache Spark 2.3
DataWorks Summit
 
PDF
Fast and Reliable Apache Spark SQL Engine
Databricks
 
PDF
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
Databricks
 
PDF
Getting started with Apollo Client and GraphQL
Morgan Dedmon
 
PDF
Graal Tutorial at CGO 2015 by Christian Wimmer
Thomas Wuerthinger
 
PDF
Dependency Injection in Apache Spark Applications
Databricks
 
PDF
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
PDF
Introduction to MLflow
Databricks
 
The good, the bad, and the ugly of Java API design
Miro Cupak
 
What’s expected in Spring 5
Gal Marder
 
The good, the bad, and the ugly of Java API design
Miro Cupak
 
Micro-Benchmarking Considered Harmful
Thomas Wuerthinger
 
Production Readiness Testing At Salesforce Using Spark MLlib
Spark Summit
 
SSR: Structured Streaming for R and Machine Learning
felixcss
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Formulatedby
 
Scalable Data Science in Python and R on Apache Spark
felixcss
 
Evolving Streaming Applications
DataWorks Summit
 
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Flink Forward
 
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
ORC improvement in Apache Spark 2.3
DataWorks Summit
 
Fast and Reliable Apache Spark SQL Engine
Databricks
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
Databricks
 
Getting started with Apollo Client and GraphQL
Morgan Dedmon
 
Graal Tutorial at CGO 2015 by Christian Wimmer
Thomas Wuerthinger
 
Dependency Injection in Apache Spark Applications
Databricks
 
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
Introduction to MLflow
Databricks
 

Similar to BigDataSpain 2016: Stream Processing Applications with Apache Apex (20)

PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
PPTX
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
PDF
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
PDF
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
PPTX
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
PPTX
Introduction to Apache Apex
Apache Apex
 
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
PPTX
Deep Dive into Apache Apex App Development
Apache Apex
 
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PDF
Streaming architecture patterns
hadooparchbook
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Big Data Spain
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Apache Apex
 
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Introduction to Apache Apex
Apache Apex
 
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Deep Dive into Apache Apex App Development
Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Streaming architecture patterns
hadooparchbook
 
Ad

Recently uploaded (20)

PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
Open Source Milvus Vector Database v 2.6
Zilliz
 
PDF
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PDF
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
PDF
From Chatbot to Destroyer of Endpoints - Can ChatGPT Automate EDR Bypasses (1...
Priyanka Aash
 
PDF
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Open Source Milvus Vector Database v 2.6
Zilliz
 
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
From Chatbot to Destroyer of Endpoints - Can ChatGPT Automate EDR Bypasses (1...
Priyanka Aash
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
Ad

BigDataSpain 2016: Stream Processing Applications with Apache Apex

  • 1. Building Streaming Applications with Apache Apex Thomas Weise <[email protected]> @thweise PMC Chair Apache Apex, Architect DataTorrent Big Data Spain, Madrid, Nov 18th 2016
  • 2. Agenda 2 • Application Development Model • Creating Apex Application - Project Structure • Application Developer API • Configuration Example • Operator Developer API • Overview of Operator Library • Frequently used Connectors • Stateful Transformation & Windowing • Scalability – Partitioning • End-to-end Exactly Once
  • 3. Application Development Model 3 ▪Stream is a sequence of data tuples ▪Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded ▪Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output Stream Tupl e Tupl e er Operator er Operator er Operator er Operator er Operator er Operator
  • 4. Creating Apex Application Project 4 chinmay@chinmay-VirtualBox:~/src$ mvn archetype:generate -DarchetypeGroupId=org.apache.apex - DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=LATEST -DgroupId=com.example - Dpackage=com.example.myapexapp -DartifactId=myapexapp -Dversion=1.0-SNAPSHOT … … ... Confirm properties configuration: groupId: com.example artifactId: myapexapp version: 1.0-SNAPSHOT package: com.example.myapexapp archetypeVersion: LATEST Y: : Y … … ... [INFO] project created from Archetype in dir: /media/sf_workspace/src/myapexapp [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 13.141 s [INFO] Finished at: 2016-11-15T14:06:56+05:30 [INFO] Final Memory: 18M/216M [INFO] ------------------------------------------------------------------------ chinmay@chinmay-VirtualBox:~/src$ https://www.youtube.com/watch?v=z-eeh-tjQrc
  • 5. Apex Application Project Structure 5 • pom.xml •Defines project structure and dependencies •Application.java •Defines the DAG •RandomNumberGenerator.java •Sample Operator •properties.xml •Contains operator and application properties and attributes •ApplicationTest.java •Sample test to test application in local mode
  • 6. API: Compositional (low level) 6 Input Parser Counter Output CountsWordsLines Kafka Database Filter Filtered
  • 7. API: Declarative (high level) 7 File Input Parser Word Counter Console Output CountsWordsLines Folder StdOut StreamFactory.fromFolder("/tmp") .flatMap(input -> Arrays.asList(input.split(" ")), name("Words")) .window(new WindowOption.GlobalWindow(), new TriggerOption().accumulatingFiredPanes().withEarlyFiringsAtEvery(1)) .countByKey(input -> new Tuple.PlainTuple<>(new KeyValPair<>(input, 1L)), name("countByKey")) .map(input -> input.getValue(), name("Counts")) .print(name("Console")) .populateDag(dag);
  • 8. API: SQL 8 Kafka Input CSV Parser Filter CSV Formattter FilteredWordsLines Kafka File Project Projected Line Writer Formatted SQLExecEnvironment.getEnvironment() .registerTable("ORDERS", new KafkaEndpoint(conf.get("broker"), conf.get("topic"), new CSVMessageFormat(conf.get("schemaInDef")))) .registerTable("SALES", new FileEndpoint(conf.get("destFolder"), conf.get("destFileName"), new CSVMessageFormat(conf.get("schemaOutDef")))) .registerFunction("APEXCONCAT", this.getClass(), "apex_concat_str") .executeSQL(dag, "INSERT INTO SALES " + "SELECT STREAM ROWTIME, FLOOR(ROWTIME TO DAY), APEXCONCAT('OILPAINT', SUBSTRING(PRODUCT, 6, 7) " + "FROM ORDERS WHERE ID > 3 AND PRODUCT LIKE 'paint%'");
  • 9. API: Beam 9 • Apex Runner for Apache Beam is now available!! •Build once run-anywhere model •Beam Streaming applications can be run on apex runner: public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class); // Run with Apex runner options.setRunner(ApexRunner.class); Pipeline p = Pipeline.create(options); p.apply("ReadLines", TextIO.Read.from(options.getInput())) .apply(new CountWords()) .apply(MapElements.via(new FormatAsTextFn())) .apply("WriteCounts", TextIO.Write.to(options.getOutput())); .run().waitUntilFinish(); }
  • 10. API: SAMOA 10 •Build once run-anywhere model for online machine learning algorithms •Any machine learning algorithm present in SAMOA can be run directly on Apex. •Uses Apex Iteration Support •Following example does classification of input data from HDFS using VHT algorithm on Apex: ᵒ bin/samoa apex ../SAMOA-Apex-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -l (classifiers.trees.VerticalHoeffdingTree -p 1) -s (org.apache.samoa.streams.ArffFileStream -s HDFSFileStreamSource -f /tmp/user/input/covtypeNorm.arff)"
  • 11. Configuration (properties.xml) 11 Input Parser Counter Output CountsWordsLines Kafka Database Filter Filtered
  • 12. Streaming Window Processing Time 12 •Finite time sliced windows based on processing (event arrival) time •Used for bookkeeping of streaming application •Derived Windows are: Checkpoint Windows, Committed Windows
  • 13. Operator API 13 Next streaming window Next streaming window Input Adapters - Starting of the pipeline. Interacts with external system to generate stream Generic Operators - Processing part of pipeline Output Adapters - Last operator in pipeline. Interacts with external system to finalize the processed stream OutputPort::emit()
  • 14. Operator Library 14 RDBMS • JDBC • MySQL • Oracle • MemSQL NoSQL • Cassandra, HBase • Aerospike, Accumulo • Couchbase/ CouchDB • Redis, MongoDB • Geode Messaging • Kafka • JMS (ActiveMQ etc.) • Kinesis, SQS • Flume, NiFi File Systems • HDFS/ Hive • Local File • S3 Parsers • XML • JSON • CSV • Avro • Parquet Transformations • Filters, Expression, Enrich • Windowing, Aggregation • Join • Dedup Analytics • Dimensional Aggregations (with state management for historical data + query) Protocols • HTTP • FTP • WebSocket • MQTT • SMTP Other • Elastic Search • Script (JavaScript, Python, R) • Solr • Twitter
  • 15. Frequently used Connectors Kafka Input 15 KafkaSinglePortInputOperator KafkaSinglePortByteArrayInputOperator Library malhar-contrib malhar-kafka Kafka Consumer 0.8 0.9 Emit Type byte[] byte[] Fault-Tolerance At Least Once, Exactly Once At Least Once, Exactly Once Scalability Static and Dynamic (with Kafka metadata) Static and Dynamic (with Kafka metadata) Multi-Cluster/Topic Yes Yes Idempotent Yes Yes Partition Strategy 1:1, 1:M 1:1, 1:M
  • 16. Frequently used Connectors Kafka Output 16 KafkaSinglePortOutputOperator KafkaSinglePortExactlyOnceOutputOperator Library malhar-contrib malhar-kafka Kafka Producer 0.8 0.9 Fault-Tolerance At Least Once At Least Once, Exactly Once Scalability Static and Dynamic (with Kafka metadata) Static and Dynamic, Automatic Partitioning based on Kafka metadata Multi-Cluster/Topic Yes Yes Idempotent Yes Yes Partition Strategy 1:1, 1:M 1:1, 1:M
  • 17. Frequently used Connectors File Input 17 •AbstractFileInputOperator •Used to read a file from source and emit the content of the file to downstream operator •Operator is idempotent •Supports Partitioning •Concrete classes: •FileLineInputOperator •AvroFileInputOperator •ParquetFilePOJOReader •https://www.datatorrent.com/blog/f ault-tolerant-file-processing/
  • 18. Frequently used Connectors File Output 18 •AbstractFileOutputOperator •Writes data to a file •Supports Partitions •Exactly-once results •Upstream operators should be idempotent •Concrete classes: •StringFileOutputOperator •https://www.datatorrent.com/blog/f ault-tolerant-file-processing/
  • 19. Windowing Support 19 • Event-time Windows • Computation based on event-time present in the tuple • Types of windows: ᵒ Global : Single event-time window throughout the lifecycle of application ᵒ Timed : Tuple is assigned to single, non-overlapping, fixed width windows immediately followed by next window ᵒ Sliding Time : Tuple is can be assigned to multiple, overlapping fixed width windows. ᵒ Session : Tuple is assigned to single, variable width windows with predefined min gap
  • 20. Stateful Windowed Processing 20 • WindowedOperator (org.apache.apex:malhar-library) • Used to process data based on Event time as contrary to ingression time • Supports windowing semantics of Apache Beam model • Features: ᵒ Watermarks ᵒ Allowed Lateness ᵒ Accumulation ᵒ Accumulation Modes: Accumulating, Discarding, Accumulating & Retracting ᵒ Triggers • Storage ᵒ In memory (checkpointed) ᵒ Managed State
  • 21. Stateful Windowed Processing Compositional API 21 @Override public void populateDAG(DAG dag, Configuration configuration) { WordGenerator inputOperator = new WordGenerator(); KeyedWindowedOperatorImpl windowedOperator = new KeyedWindowedOperatorImpl(); Accumulation<Long, MutableLong, Long> sum = new SumAccumulation(); windowedOperator.setAccumulation(sum); windowedOperator.setDataStorage(new InMemoryWindowedKeyedStorage<String, MutableLong>()); windowedOperator.setRetractionStorage(new InMemoryWindowedKeyedStorage<String, Long>()); windowedOperator.setWindowStateStorage(new InMemoryWindowedStorage<WindowState>()); windowedOperator.setWindowOption(new WindowOption.TimeWindows(Duration.standardMinutes(1))); windowedOperator.setTriggerOption(TriggerOption.AtWatermark() .withEarlyFiringsAtEvery(Duration.millis(1000)) .accumulatingAndRetractingFiredPanes()); windowedOperator.setAllowedLateness(Duration.millis(14000)); ConsoleOutputOperator outputOperator = new ConsoleOutputOperator(); dag.addOperator("inputOperator", inputOperator); dag.addOperator("windowedOperator", windowedOperator); dag.addOperator("outputOperator", outputOperator); dag.addStream("input_windowed", inputOperator.output, windowedOperator.input); dag.addStream("windowed_output", windowedOperator.output, outputOperator.input); }
  • 22. Stateful Windowed Processing Declarative API 22 StreamFactory.fromFolder("/tmp") .flatMap(input -> Arrays.asList(input.split(" ")), name("ExtractWords")) .map(input -> new TimestampedTuple<>(System.currentTimeMillis(), input), name("AddTimestampFn")) .window(new TimeWindows(Duration.standardMinutes(WINDOW_SIZE)), new TriggerOption().accumulatingFiredPanes().withEarlyFiringsAtEvery(1)) .countByKey(input -> new TimestampedTuple<>(input.getTimestamp(), new KeyValPair<>(input.getValue(), 1L))), name("countWords")) .map(new FormatAsTableRowFn(), name("FormatAsTableRowFn")) .print(name("console")) .populateDag(dag);
  • 23. Scalability - Partitioning 23 0 1 2 3 Logical DAG 0 1 2 U Physical DAG 1 1 2 2 3 Parallel Partitions M x N Partitions OR Shuffle <configuration> <property> <name>dt.operator.1.attr.PARTITIONER</name> <value>com.datatorrent.common.partitioner.StatelessPartitioner:3</value> </property> <property> <name>dt.operator.2.port.inputPortName.attr.PARTITION_PARALLEL</name> <value>true</value> </property> </configuration>
  • 24. How tuples are split between partitions 24 • Tuple hashcode and mask used to determine destination partition ᵒ Mask picks the last n bits of the hashcode of the tuple ᵒ hashcode method can be overridden • StreamCodec can be used to specify custom hashcode for tuples ᵒ Can also be used for specifying custom serialization tuple: { Name, 24204842, San Jose } Hashcode: 0010101000101 01 Mask (0x11) Partition 00 1 01 2 10 3 11 4
  • 25. End-to-End Exactly-Once 25 Input Counter Store Aggregate CountsWords Kafka Database ● Input ○ Uses com.datatorrent.contrib.kafka.KafkaSinglePortStringInputOperator ○ Emits words as a stream ○ Operator is idempotent ● Counter ○ com.datatorrent.lib.algo.UniqueCounter ● Store ○ Uses CountStoreOperator ○ Inserts into JDBC ○ Exactly-once results (End-To-End Exactly-once = At-least-once + Idempotency + Consistent State) https://github.com/DataTorrent/examples/blob/master/tutorials/exactly-once https://www.datatorrent.com/blog/end-to-end-exactly-once-with-apache-apex/
  • 26. End-to-End Exactly-Once (contd.) 26 Input Counter Store Aggregate CountsWords Kafka Database public static class CountStoreOperator extends AbstractJdbcTransactionableOutputOperator<KeyValPair<String, Integer>> { public static final String SQL = "MERGE INTO words USING (VALUES ?, ?) I (word, wcount)" + " ON (words.word=I.word)" + " WHEN MATCHED THEN UPDATE SET words.wcount = words.wcount + I.wcount" + " WHEN NOT MATCHED THEN INSERT (word, wcount) VALUES (I.word, I.wcount)"; @Override protected String getUpdateCommand() { return SQL; } @Override protected void setStatementParameters(PreparedStatement statement, KeyValPair<String, Integer> tuple) throws SQLException { statement.setString(1, tuple.getKey()); statement.setInt(2, tuple.getValue()); } }
  • 28. Who is using Apex? 28 • Powered by Apex ᵒ http://apex.apache.org/powered-by-apex.html ᵒ Also using Apex? Let us know to be added: [email protected] or @ApacheApex • Pubmatic ᵒ https://www.youtube.com/watch?v=JSXpgfQFcU8 • GE ᵒ https://www.youtube.com/watch?v=hmaSkXhHNu0 ᵒ http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service- using-apache-apex-hadoop • SilverSpring Networks ᵒ https://www.youtube.com/watch?v=8VORISKeSjI ᵒ http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by- silver-spring-networks
  • 29. Resources 29 • http://apex.apache.org/ • Learn more - http://apex.apache.org/docs.html • Getting involved - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups - https://www.meetup.com/topics/apache-apex/ • Examples - https://github.com/DataTorrent/examples • Slideshare - http://www.slideshare.net/ApacheApex/presentations • https://www.youtube.com/results?search_query=apache+apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-accelerator/