SlideShare a Scribd company logo
engineering.deltax.com
Building a Real-time Stream
Processing Pipeline
Akshay Surve, CTO DeltaX
akshay@deltax.com / @ak47surve
Hastag: #awsblr #meetup
engineering.deltax.com
● 12 years
○ Shipping Ideas, Making Mistakes, GTD
○ Marathons / Hackathons / *-athon :)
● Co-founded DeltaX in 2013
○ Ad-tech / Product Startup
○ 300+ advertisers across India, APAC and US.
About Me
2
engineering.deltax.com
Agenda
● Use-case
● Processing Models
● Old Batch Processing Architecture
○ Challenges
● Goals
● Moving Blocks for a Stream Processing Model
○ Kinesis Data Firehose
○ Amazon ElasticSearch
○ Amazon Athena
● Review New Stream Processing Architecture 3
engineering.deltax.com
Use-case
● Ad Tracking & Ad Serving
● Cloud Architecture
4
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
5
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
6
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
Advertiser
7
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
Event
8
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
Timestamp
9
engineering.deltax.com
Use-case
- Cloud Architecture
10
engineering.deltax.com
● Batch Processing
● Stream Processing
Processing Models
11
engineering.deltax.com
● Batch Processing
Processing Models
Input OutputBatch Job(s)
12
engineering.deltax.com
● Stream Processing
Processing Models
Queue
Stream
Processor
Output
13
engineering.deltax.com
● Batch vs Stream
Processing Models
Batch Stream
High Latency Low Latency
Static Files Event Streams
Snapshot Continuous Window
14
engineering.deltax.com
Batch Processing
15
engineering.deltax.com
Batch Processing (Close-up)
16
engineering.deltax.com
Batch Processing (Challenges)
● Modelled around batch processing and not stream processing
● Ingesting JSON files in bulk isn’t natural for SQL - JSON parsing > SQL
tables
● Varied levels of aggregations - campaign, ad, device, geo + unique metrics
● Future roadmap - userid cookie pool across advertisers; exchange based
cookie matching, etc. become challenges in itself
17
engineering.deltax.com
● Stream processing as a paradigm suits our use case the best
● Easy to maintain or managed service in the cloud would be ideal
● Developer friendly and peace of mind was of utmost importance
● Being able to ingest streaming data and query summaries was important
● Good to have a way to run batch processing framework for machine learning,
data crunching, and analysis
Goals
18
engineering.deltax.com
● Amazon Athena
● Amazon Elasticsearch
● Kinesis Data Firehose
Moving Blocks
19
engineering.deltax.com
20
engineering.deltax.com
Amazon Athena
21
engineering.deltax.com
Amazon Athena
● Persistent Store
● DDL
● Query
22
engineering.deltax.com
Amazon Athena
● Persistent Store (AWS S3)
○ Text files, e.g., CSV, raw logs
○ Apache Web Logs, TSV files
○ JSON (simple, nested)
○ Compressed files
○ Columnar formats such as Apache Parquet & Apache ORC
23
engineering.deltax.com
Amazon Athena
● Persistent Store (AWS S3)
○ JSON events
24
engineering.deltax.com
● DDL (Apache Hive)
Amazon Athena
25
engineering.deltax.com
● DDL (Apache Hive)
Amazon Athena
26
engineering.deltax.com
Amazon Athena
● Query Engine (Presto query engine)
○ In Memory
○ ANSI SQL Compliant
27
engineering.deltax.com
● Query Engine (Presto query engine)
○ In Memory
○ ANSI SQL Compliant
Amazon Athena
28
engineering.deltax.com
● Query Engine (Presto query engine)
○ In Memory
○ ANSI SQL Compliant
Amazon Athena
29
engineering.deltax.com
● Serverless
● No spin-up time
● Query data directly from S3
● ANSI SQL
Amazon Athena (Advantages)
30
engineering.deltax.com
● Queries run fast
Amazon Athena (Advantages)
31
engineering.deltax.com
Amazon Elasticsearch
32
engineering.deltax.com
Amazon Elasticsearch
● ELK Stack (Searching, Log monitoring)
● Seamless Ingestion (Document-based model)
● Real-time queries (even during ingestion; 30s refresh window; immutability)
● Meant for search; Efficient for time-series (will discuss why?)
33
engineering.deltax.com
Amazon Elasticsearch
- Document that gets ingested
34
engineering.deltax.com
Elasticsearch (Internals)
● Elasticsearch Index
○ Inverted Index
○ Doc Values
35
engineering.deltax.com
Elasticsearch (Internals)
Deeper into an Elasticsearch Index
36
engineering.deltax.com
Elasticsearch (Internals)
● Deeper into an Elasticsearch Index - Inverted Index
○ The quick brown fox jumped over the lazy dog
○ Quick brown foxes leap over lazy dogs in summer
37
engineering.deltax.com
Elasticsearch (Internals)
Deeper into an Elasticsearch Index - Doc Values
● column-oriented fashion that is way more efficient for sorting and
aggregations
● Filesystem optimized
38
engineering.deltax.com
● Integration with AWS ecosystem
Amazon Elasticsearch (Advantages)
39
engineering.deltax.com
Amazon Elasticsearch (Advantages)
● Cluster Management (scale out/up)
40
engineering.deltax.com
Amazon Elasticsearch (Advantages)
● Monitoring & Alerts
41
engineering.deltax.com
Amazon Elasticsearch (Advantages)
● Snapshot Recovery / Backup to S3
● Elasticsearch Upgrades (could be made smoother)
42
engineering.deltax.com
Amazon Elasticsearch (Advantages)
● Integration with AWS ecosystem
● Cluster Management (scale out/up)
● Monitoring & Alerts
● Snapshot Recovery / Backup to S3
● Elasticsearch Upgrades
43
engineering.deltax.com
Kinesis Data Firehose
44
engineering.deltax.com
Kinesis
45
engineering.deltax.com
Kinesis Data Firehose
46
engineering.deltax.com
Kinesis Data Firehose
● Streaming Data Processing
● Multiple destinations - S3, Redshift, ES
● Intermediate Record transformations (using AWS Lambda) before delivery to
the destination
○ Ip2location
○ Enrich flow
○ Ua-parser
● Combine with Kinesis Analytics
47
engineering.deltax.com
Kinesis Data Firehose (source)
48
engineering.deltax.com
Kinesis Data Firehose (transformation)
49
engineering.deltax.com
Kinesis Data Firehose (destination)
50
engineering.deltax.com
Kinesis Data Firehose (ES config options)
51
engineering.deltax.com
Kinesis Data Firehose (ES destination)
Node.js (tracker) >
52
engineering.deltax.com
Kinesis Data Firehose (Advantages)
● Cloud Offering
53
Source: https://blog.ippon.tech/spark-storm-s
xd-comparison/
engineering.deltax.com
Kinesis Data Firehose (Advantages)
● Pluggability
54
Source: https://www.slideshare.net/AmazonWebServices/aws-reinvent-
2016-analyzing-streaming-data-in-realtime-with-amazon-kinesis-analytics-
bdm304
engineering.deltax.com
Kinesis Data Firehose
(Architecture)
55
engineering.deltax.com
Architecture
(Old vs New)
56
engineering.deltax.com
Stats
● Data: ~12 GB / day (peaks of 32 GB/day)
57
engineering.deltax.com
“The cloud is not a silver bullet”
silver bullet ~ noun
‘a simple and seemingly magical solution to a complicated problem’
Twitter - @ak47suve #awsblr #meetup
Email - akshay@deltax.com
Blog - engineering.deltax.com
58

More Related Content

PDF
Migrating a multi tenant app to Azure (war biopic)
PDF
Scylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public Cloud
PPTX
Bleeding Edge Databases
PDF
Scylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDB
PPT
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
PPTX
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
PDF
AWS Athena vs. Google BigQuery for interactive SQL Queries
PPTX
AWS for the Data Professional
Migrating a multi tenant app to Azure (war biopic)
Scylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public Cloud
Bleeding Edge Databases
Scylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDB
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS for the Data Professional

What's hot (20)

PPTX
AWS for Big Data Experts
PDF
Beyond Relational
PDF
Introducing the Hub for Data Orchestration
PPTX
Webinar: Building Blocks for the Future of Television
PPTX
SQL Server on Google Cloud Platform
PDF
Introduction to AWS Outposts
PPTX
New AWS Services for Bioinformatics
PDF
Streaming 4 billion Messages per day. Lessons Learned.
PDF
Apache Cassandra in the Cloud
PDF
Redshift VS BigQuery
PDF
Análisis del roadmap del Elastic Stack
PPTX
Not only SQL - Database Choices
PPTX
Scylla Summit 2018: Grab and Scylla: Driving Southeast Asia Forward
PPTX
Serverless Reality
PDF
Aws Kinesis
PDF
Polyglot persistence @ netflix (CDE Meetup)
PPTX
Curriculum Associates Strata NYC 2017
PDF
Deep Learning in the Cloud at Scale: A Data Orchestration Story
PPTX
Scaling Traffic from 0 to 139 Million Unique Visitors
PPTX
Microsoft Machine Learning Smackdown
AWS for Big Data Experts
Beyond Relational
Introducing the Hub for Data Orchestration
Webinar: Building Blocks for the Future of Television
SQL Server on Google Cloud Platform
Introduction to AWS Outposts
New AWS Services for Bioinformatics
Streaming 4 billion Messages per day. Lessons Learned.
Apache Cassandra in the Cloud
Redshift VS BigQuery
Análisis del roadmap del Elastic Stack
Not only SQL - Database Choices
Scylla Summit 2018: Grab and Scylla: Driving Southeast Asia Forward
Serverless Reality
Aws Kinesis
Polyglot persistence @ netflix (CDE Meetup)
Curriculum Associates Strata NYC 2017
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Scaling Traffic from 0 to 139 Million Unique Visitors
Microsoft Machine Learning Smackdown
Ad

Similar to Building a Real-time Stream Processing Pipeline - Kinesis Data Firehose, Amazon Elasticsearch, Amazon Athena (20)

PDF
Documenting serverless architectures could we do it better - o'reily sa con...
PPTX
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
PDF
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
PDF
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
PDF
Introducing the ultimate MariaDB cloud, SkySQL
PPTX
cloud computing for civil engineers basics
PDF
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
PDF
Shaping serverless architecture with domain driven design patterns - py web-il
PDF
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
PPT
KSCOPE 2013: Exadata Consolidation Success Story
PDF
Designing for operability and managability
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PDF
Data pipelines from zero to solid
PDF
Presto @ Zalando - Big Data Tech Warsaw 2020
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
The hidden engineering behind machine learning products at Helixa
PDF
Database automation guide - Oracle Community Tour LATAM 2023
PDF
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
PDF
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
Documenting serverless architectures could we do it better - o'reily sa con...
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
Introducing the ultimate MariaDB cloud, SkySQL
cloud computing for civil engineers basics
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
Shaping serverless architecture with domain driven design patterns - py web-il
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
KSCOPE 2013: Exadata Consolidation Success Story
Designing for operability and managability
Introduction to Apache Tajo: Data Warehouse for Big Data
Data pipelines from zero to solid
Presto @ Zalando - Big Data Tech Warsaw 2020
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
The hidden engineering behind machine learning products at Helixa
Database automation guide - Oracle Community Tour LATAM 2023
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
Ad

More from ★ Akshay Surve (6)

PPTX
How I stopped watching p0rn and other *kinkiness*
PPTX
Blogging4Good @ BlogCamp Mumbai 2010 - Ads4Good.org
ZIP
Web Applicaitons - a roller coaster ride
PPT
Khelvigyan Project - Children Toy Foundation
PDF
SocialSync - Why it exists?
PPT
SocialSync
How I stopped watching p0rn and other *kinkiness*
Blogging4Good @ BlogCamp Mumbai 2010 - Ads4Good.org
Web Applicaitons - a roller coaster ride
Khelvigyan Project - Children Toy Foundation
SocialSync - Why it exists?
SocialSync

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPT
Predictive modeling basics in data cleaning process
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Analytics and business intelligence.pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Data Science and Data Analysis
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Clinical guidelines as a resource for EBP(1).pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
STUDY DESIGN details- Lt Col Maksud (21).pptx
IB Computer Science - Internal Assessment.pptx
[EN] Industrial Machine Downtime Prediction
Predictive modeling basics in data cleaning process
SAP 2 completion done . PRESENTATION.pptx
modul_python (1).pptx for professional and student
IBA_Chapter_11_Slides_Final_Accessible.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Qualitative Qantitative and Mixed Methods.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

Building a Real-time Stream Processing Pipeline - Kinesis Data Firehose, Amazon Elasticsearch, Amazon Athena