SlideShare a Scribd company logo
DEBS Grand Challenge: Continuous
Analytics on Geospatial Data
Streams with WSO2 Complex Event
Processor
Sachini Jayasekara, Srinath Perera,
Miyuru Dayarathna,
Sriskandarajah Suhothayan
WSO2 Inc.
Problem
o Dataset Taxi rides collected
from New York in year 2013[1]
o Each line has timestamp, start
end locations, fare details etc.
o 13K cars, 173 million events
o 2 Queries
o Queries based on 0.5km and
0.25km cells over New York.
[1]. Chris Whong (http://chriswhong.com/open-data/foil_nyc_taxi/)
CEP Operators
1. Filters or transformations (process a single event)
from Ball[v>10]
select .. insert into ..
2. Windows + aggregation (track window of events:
time, length)
from Ball#window.time(30s) select avg(v) ..
3. Joins (join two event streams to one)
from Ball#window.time(30s) as b join Players as p on p.v <
b.v
4. Patterns (state machine implementation)
from Ball[v>10], Ball[v<10]*,Ball[v>10] select ..
5. Event tables (map a database as an event stream)
Define table HitV (v double) using .. db info ..
Complex Event Processing
see http://goo.gl/BaPFYA for more info.
Query 1: Frequent Routes
o Output 10 most frequent routes in last 30 minutes
o Need to output when value has changed ( current
time derived from event’s timestamp attribute)
Query 2: Profitable Areas
o Find the cells that are most profitable for taxi
drivers at the given moment.
o Profitability = median (fare + tip) for last 15 minutes
divided by the number of taxi drivers who have
dropped-off and have not taken a new trip in the
last 30 minutes per cell.
Optimizations
o WSO2 CEP
o Object Pooling
o Only keep required Attributes (e.g., in window)
o Algorithmic
o String Lookup
o Reusing windows
o Avoid Join
o FrequentK
o Counting Pattern
o Median (Bucket)
o Fully use the computer
Avoid Joins
o Q2 process median and taxi counting in parallel
o But join is expensive due to ordering
o Instead, calculate median, enrich the event with
results, use enriched event to calculate empty taxi,
then divide median by empty taxi without a join.
Taxi Counting Pattern Optimizations
o Query creates a state machine to track taxi’s state,
and update counts accordingly
o Slow with CEP pattern as it searches all states to
check for expiration
o Fixed by keeping states sorted by starting time (2X
improvement)
Fully use the Computer
o So far, we remove unnecessary operations!!
o Now we have to use all 4 cores of the VM
o How?
o Data Partition
o Pipeline
o Pipeline with single buffer
Data Partition : Issues
o Need to reorder and send timing updates
o But savings due to partition is small (e.g. frequentK is O(log
(n)) and execution in a partition take O(log(n/p))
o All savings lost when reordering
Execution Pipeline
o Break different stages to a pipeline
o Now we can use 6 threads ( 1 and 6 does IO so OK)
o 125K/sec now, but 50ms latency
o Bottleneck is moving events between queues
Circular Buffer based Pipeline
o One circular buffer with sequence barriers using
LMAX disruptor
o Avoid cost of moving events, reduce GC, and works
well with the cache
o 2X more throughput and 0ms latency
Results
o Pretty good on real HW (8 core) and AWS ( 4 core),
but not as good on VirtualBox ( 4 core)
o Can run on 512M heap size with only 10% slowdown
Results: Speedup vs. Concurrency
o Compared against single node version
o Real HW scaled well, AWS less and VM scale up was
very small
Results: Latency vs. Throughput
o each point is (env, thread count, size of buffer)
Conclusion
o All changes except final
circular buffer in WSO2
CEP 4.0 ( released 2015
Q3)
o WSO2 CEP is free and
available under Apache
Open source Licence
o Fast and flexible, and
already used in many
critical use cases.

More Related Content

PPT
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
PPT
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014
PPTX
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
PDF
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
PDF
ACM DEBS 2015: Realtime Streaming Analytics Patterns
PDF
Introduction to influx db
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
PDF
Chronix Time Series Database - The New Time Series Kid on the Block
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Introduction to influx db
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chronix Time Series Database - The New Time Series Kid on the Block

What's hot (20)

PDF
InfluxDB & Grafana
PPTX
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
PDF
Apache Solr as a compressed, scalable, and high performance time series database
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
PDF
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
PDF
Optimizing Terascale Machine Learning Pipelines with Keystone ML
PDF
Reactive mistakes reactive nyc
PPTX
Debunking Common Myths in Stream Processing
PDF
Self-managed and automatically reconfigurable stream processing
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
PPTX
Mining data streams
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PPTX
First Flink Bay Area meetup
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
PPTX
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
PDF
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Spark Summit EU talk by Qifan Pu
PDF
Vasia Kalavri – Training: Gelly School
InfluxDB & Grafana
Virtual Flink Forward 2020: Cogynt: Flink without code - Samantha Chan, Aslam...
Apache Solr as a compressed, scalable, and high performance time series database
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Reactive mistakes reactive nyc
Debunking Common Myths in Stream Processing
Self-managed and automatically reconfigurable stream processing
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Mining data streams
Continuous Processing with Apache Flink - Strata London 2016
First Flink Bay Area meetup
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Apache Flink: API, runtime, and project roadmap
Spark Summit EU talk by Qifan Pu
Vasia Kalavri – Training: Gelly School
Ad

Viewers also liked (7)

PDF
Thingamy innovation processor by process innovation
PPT
Intel\'s Processor Innovation Frontline 2009 1Q
PPT
Continuous Monitoring
PPTX
Evolution of Intel Processors
PPTX
Intel Processors
PPTX
Intel I3,I5,I7 Processor
Thingamy innovation processor by process innovation
Intel\'s Processor Innovation Frontline 2009 1Q
Continuous Monitoring
Evolution of Intel Processors
Intel Processors
Intel I3,I5,I7 Processor
Ad

Similar to ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams with WSO2 Complex Event Processor (20)

PDF
Etienne chauchot spark structured streaming runner
PPT
Material Handling System
PDF
pipelining ppt.pdf
PPTX
19th Session.pptx
PDF
IIIRJET-Implementation of Image Compression Algorithm on FPGA
PDF
Urban flood prediction digital ocean august edition
PDF
"Stateful app as an efficient way to build dispatching for riders and drivers...
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PDF
Load testing of HELIDEM geo-portal: an OGC open standards interoperability ex...
PDF
Mcs 012 computer organisation and assemly language programming- ignou assignm...
PPTX
Ca unit v 27 9-2020
PPTX
Prediction of taxi rides ETA
PDF
How We Added Replication to QuestDB - JonTheBeach
PPTX
Crash course on data streaming (with examples using Apache Flink)
PDF
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
PPTX
20220201_semi dynamic STAQ application on BBMB.pptx
PPTX
CS 542 -- Query Execution
PDF
How to build an event driven architecture with kafka and kafka connect
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PDF
Cassandra NYC 2011 Data Modeling
Etienne chauchot spark structured streaming runner
Material Handling System
pipelining ppt.pdf
19th Session.pptx
IIIRJET-Implementation of Image Compression Algorithm on FPGA
Urban flood prediction digital ocean august edition
"Stateful app as an efficient way to build dispatching for riders and drivers...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Load testing of HELIDEM geo-portal: an OGC open standards interoperability ex...
Mcs 012 computer organisation and assemly language programming- ignou assignm...
Ca unit v 27 9-2020
Prediction of taxi rides ETA
How We Added Replication to QuestDB - JonTheBeach
Crash course on data streaming (with examples using Apache Flink)
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
20220201_semi dynamic STAQ application on BBMB.pptx
CS 542 -- Query Execution
How to build an event driven architecture with kafka and kafka connect
Optimizing the Graphics Pipeline with Compute, GDC 2016
Cassandra NYC 2011 Data Modeling

More from Srinath Perera (20)

PDF
Book: Software Architecture and Decision-Making
PDF
Data science Applications in the Enterprise
PDF
An Introduction to APIs
PDF
An Introduction to Blockchain for Finance Professionals
PDF
AI in the Real World: Challenges, and Risks and how to handle them?
PDF
Healthcare + AI: Use cases & Challenges
PDF
How would AI shape Future Integrations?
PDF
The Role of Blockchain in Future Integrations
PDF
Future of Serverless
PDF
Blockchain: Where are we? Where are we going?
PDF
Few thoughts about Future of Blockchain
PDF
A Visual Canvas for Judging New Technologies
PDF
Privacy in Bigdata Era
PDF
Blockchain, Impact, Challenges, and Risks
PPTX
Today's Technology and Emerging Technology Landscape
PDF
An Emerging Technologies Timeline
PDF
The Rise of Streaming SQL and Evolution of Streaming Applications
PDF
Analytics and AI: The Good, the Bad and the Ugly
PDF
Transforming a Business Through Analytics
PDF
SoC Keynote:The State of the Art in Integration Technology
Book: Software Architecture and Decision-Making
Data science Applications in the Enterprise
An Introduction to APIs
An Introduction to Blockchain for Finance Professionals
AI in the Real World: Challenges, and Risks and how to handle them?
Healthcare + AI: Use cases & Challenges
How would AI shape Future Integrations?
The Role of Blockchain in Future Integrations
Future of Serverless
Blockchain: Where are we? Where are we going?
Few thoughts about Future of Blockchain
A Visual Canvas for Judging New Technologies
Privacy in Bigdata Era
Blockchain, Impact, Challenges, and Risks
Today's Technology and Emerging Technology Landscape
An Emerging Technologies Timeline
The Rise of Streaming SQL and Evolution of Streaming Applications
Analytics and AI: The Good, the Bad and the Ugly
Transforming a Business Through Analytics
SoC Keynote:The State of the Art in Integration Technology

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
Foundation of Data Science unit number two notes
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Lecture1 pattern recognition............
PPTX
A Quantitative-WPS Office.pptx research study
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Database Infoormation System (DBIS).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
oil_refinery_comprehensive_20250804084928 (1).pptx
Moving the Public Sector (Government) to a Digital Adoption
.pdf is not working space design for the following data for the following dat...
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Introduction to machine learning and Linear Models
Foundation of Data Science unit number two notes
Reliability_Chapter_ presentation 1221.5784
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Quality review (1)_presentation of this 21
Lecture1 pattern recognition............
A Quantitative-WPS Office.pptx research study
Business Acumen Training GuidePresentation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Knowledge Engineering Part 1
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
climate analysis of Dhaka ,Banglades.pptx
Fluorescence-microscope_Botany_detailed content
Database Infoormation System (DBIS).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams with WSO2 Complex Event Processor

  • 1. DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams with WSO2 Complex Event Processor Sachini Jayasekara, Srinath Perera, Miyuru Dayarathna, Sriskandarajah Suhothayan WSO2 Inc.
  • 2. Problem o Dataset Taxi rides collected from New York in year 2013[1] o Each line has timestamp, start end locations, fare details etc. o 13K cars, 173 million events o 2 Queries o Queries based on 0.5km and 0.25km cells over New York. [1]. Chris Whong (http://chriswhong.com/open-data/foil_nyc_taxi/)
  • 3. CEP Operators 1. Filters or transformations (process a single event) from Ball[v>10] select .. insert into .. 2. Windows + aggregation (track window of events: time, length) from Ball#window.time(30s) select avg(v) .. 3. Joins (join two event streams to one) from Ball#window.time(30s) as b join Players as p on p.v < b.v 4. Patterns (state machine implementation) from Ball[v>10], Ball[v<10]*,Ball[v>10] select .. 5. Event tables (map a database as an event stream) Define table HitV (v double) using .. db info ..
  • 4. Complex Event Processing see http://goo.gl/BaPFYA for more info.
  • 5. Query 1: Frequent Routes o Output 10 most frequent routes in last 30 minutes o Need to output when value has changed ( current time derived from event’s timestamp attribute)
  • 6. Query 2: Profitable Areas o Find the cells that are most profitable for taxi drivers at the given moment. o Profitability = median (fare + tip) for last 15 minutes divided by the number of taxi drivers who have dropped-off and have not taken a new trip in the last 30 minutes per cell.
  • 7. Optimizations o WSO2 CEP o Object Pooling o Only keep required Attributes (e.g., in window) o Algorithmic o String Lookup o Reusing windows o Avoid Join o FrequentK o Counting Pattern o Median (Bucket) o Fully use the computer
  • 8. Avoid Joins o Q2 process median and taxi counting in parallel o But join is expensive due to ordering o Instead, calculate median, enrich the event with results, use enriched event to calculate empty taxi, then divide median by empty taxi without a join.
  • 9. Taxi Counting Pattern Optimizations o Query creates a state machine to track taxi’s state, and update counts accordingly o Slow with CEP pattern as it searches all states to check for expiration o Fixed by keeping states sorted by starting time (2X improvement)
  • 10. Fully use the Computer o So far, we remove unnecessary operations!! o Now we have to use all 4 cores of the VM o How? o Data Partition o Pipeline o Pipeline with single buffer
  • 11. Data Partition : Issues o Need to reorder and send timing updates o But savings due to partition is small (e.g. frequentK is O(log (n)) and execution in a partition take O(log(n/p)) o All savings lost when reordering
  • 12. Execution Pipeline o Break different stages to a pipeline o Now we can use 6 threads ( 1 and 6 does IO so OK) o 125K/sec now, but 50ms latency o Bottleneck is moving events between queues
  • 13. Circular Buffer based Pipeline o One circular buffer with sequence barriers using LMAX disruptor o Avoid cost of moving events, reduce GC, and works well with the cache o 2X more throughput and 0ms latency
  • 14. Results o Pretty good on real HW (8 core) and AWS ( 4 core), but not as good on VirtualBox ( 4 core) o Can run on 512M heap size with only 10% slowdown
  • 15. Results: Speedup vs. Concurrency o Compared against single node version o Real HW scaled well, AWS less and VM scale up was very small
  • 16. Results: Latency vs. Throughput o each point is (env, thread count, size of buffer)
  • 17. Conclusion o All changes except final circular buffer in WSO2 CEP 4.0 ( released 2015 Q3) o WSO2 CEP is free and available under Apache Open source Licence o Fast and flexible, and already used in many critical use cases.