SlideShare a Scribd company logo
Stream Processing on AWS using
Kappa Architecture
Joey Bolduc-Gilbert
joey@xpertsea.com
Case Study: Stream Processing on AWS using Kappa Architecture
4.3B
People depend on fish
for key protein
50%
Of all fish protein
comes from farming
2x
More fish than any
other animal protein
Feed Conversion and Water Footprint
1857
6.8
756
2.9
469
1.7
3.8
1.1
Water for 1 lbs of meat (gallons)
Feed for 1 lbs of meat (lbs)
Lost to disease and poor management
for a typical shrimp farmer
-50%
Aquaculture technology gap today
Manual sampling Visual inspection Non-digital records
Aqua Farming Analytics
Are a Tool for Change
Collect data Ingest, store, process, serve data Consume data
IoT Devices Data SaaS
Annotate data, train models
Machine Learning / AI
The XpertSea Platform
Aquaculture Data
Animals
Water quality
Genetics
Ocean
Production
Transactions
Location
Equipment
Feed
Weather
Diseases
Which challenges are we facing?
Extracting value out of that data
● Highly distributed
● Need for some near real-time metrics
● Large scale aggregations (region and industry wide)
● Unreliable networks
● And many more!
The CAP theorem
Consistency Availability Partition tolerance
Pick 2?
The CAP theorem
What is a data system?
Working around the CAP Theorem
● Simple equation, Query = Function(All data)
● A data system answers questions about a dataset
● Lots of complexity caused by the mutability of data
● You obviously cannot process all your data from scratch
for every query
Lambda Architecture
Upsides & Downsides
● Immutability of the data lake
● More traceability
● Ensure you can make your system evolve quickly
● Designed to scale
Strengths
Weakness
● Complexity of maintaining two layers
● It doesn’t really beat CAP, it just reduces its complexity
Kappa Architecture
XpertSea’s
Deepwater platform
The AWS Services we use
● S3
● API Gateway
● SQS/SNS
● AWS Lambda
● ECS
● DynamoDB
● RDS
● CloudFormation
● CloudWatch
● IAM
● CloudFront
● Route 53
● Cognito (soon)
● SES (soon)
The core system
Data Ingestion
API Gateway AWS Lambda
Data lake (S3)
Metadata Store
(DynamoDB)
The core system
Data lake
{
"id": "0000d9f6-045d-438b-8058-4ee6447ba0fa",
"parent_id": "",
"timestamp": 1517801441,
"key": "event/0000d9f6-045d-438b-8058-4ee6447ba0fa/payload.json",
"schema": "WorkflowV1.json",
"type": "Workflow",
"data": {
"id": "0000d9f6-045d-438b-8058-4ee6447ba0fa",
"timestamp": 1517799978,
"created_at": 1517801441,
"created_by": "xyz"
// ....
}
}
● Stored in JSON for simplicity
● Metadata is copied to DynamoDB
● JSON Schema to handle validation
● Remember, all data is immutable!
A typical entry
The core system
Data Processing
● Generally a preprocessor to read from
the stream and compute the latest data;
● A number of processors to perform the
hard work;
● Generally a producer to cache the results
somewhere;
● Chain as many as you need!
{
"schema": "WorkflowV1.json",
"data": {
"id": "0000d9f6-045d-438b-8058-4ee6447ba0fa",
"timestamp": 1517799978,
"created_at": 1517801441.1827073,
"created_by": "portal",
"computed_1": 10,
"computed_2": 142,
// ....
},
"event_ids": [
"0000d9f6-045d-438b-8058-4ee6447ba0fa",
"id1",
"id2"
]
}
A typical message
The core system
Data Processing (Serverless)
AWS Lambda
AWS Lambda
AWS Lambda
AWS Lambda
AWS Lambda
DynamoDB
RDS
The core system
Data Processing (Containers)
Amazon SQS
ECS + Docker
Amazon SQS
ECS + Docker
DynamoDB
RDS
AWS Lambda
AWS Lambda
DynamoDB
RDS
Amazon SQS
ECS + Docker
Of course we can mix and match!
The core system
Data Processing
AWS Lambda
We use a publisher/subscriber model to notify
dependencies (higher level queries or aggregations):
● A new result is now ready to be used
● A old value was recomputed due to additional data
This can trigger immediate recalculation, or be queued to
be processed as part of a batch.
SNS is the obvious choice for this!
Amazon SNS
AWS Lambda
The core system
Serving layer (API)
API Gateway AWS Lambda
DynamoDB
RDS
Route 53
S3 bucket
The core system
Serving layer (Web App)
S3 bucket
Route 53 CloudFront
Polymer
Scaling concerns
● Multi-region to reduce latency
● New type of data? New pipeline
● Most of the system is serverless, or at least managed
● Serving data layer might need to move to Dynamo in the future
● Keeping it in a relation DB for now to facilitate our Machine Learning
training framework integration
How to scale?
Some tools we use and love
Polymer
Python 3 with Troposphere
Tools we choose not to use
● Amazon Kinesis and Kinesis Firehose (pricing)
● AWS IoT (useful for a large amount of simple devices)
● CloudWatch for log processing (we like ELK stacks better)
● Cassandra/Hadoop (too complex for now)
Resources and references
On the CAP Theroem: https://codahale.com/you-cant-sacrifice-partition-tolerance/
On Lambda Architecture: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
On Kappa Architecture: https://www.oreilly.com/ideas/questioning-the-lambda-architecture
Q & A
Thank you!

More Related Content

PPTX
Databricks Platform.pptx
PPTX
Azure Synapse Analytics Overview (r1)
PDF
Introduction to elasticsearch
PPTX
ElasticSearch Basic Introduction
PPTX
An Introduction to Elastic Search.
PDF
Microsoft Azure Overview
PPTX
App Modernisation with Microsoft Azure
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks Platform.pptx
Azure Synapse Analytics Overview (r1)
Introduction to elasticsearch
ElasticSearch Basic Introduction
An Introduction to Elastic Search.
Microsoft Azure Overview
App Modernisation with Microsoft Azure
Building Lakehouses on Delta Lake with SQL Analytics Primer

What's hot (20)

PPTX
Model Driven PowerApps
PPTX
Building Modern Data Platform with Microsoft Azure
PDF
Kafka Streams: What it is, and how to use it?
PDF
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
PPTX
Azure Synapse Analytics Overview (r2)
PPTX
Microsoft Cloud Adoption Framework for Azure: Thru Partner Governance Workshop
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PPTX
Azure Migration Program Pitch Deck
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
PPTX
Microsoft Azure Databricks
PDF
Couchbase Day
PPTX
Big data architectures and the data lake
PDF
Graph-Based Customer Journey Analytics with Neo4j
PPTX
Introduction to Elasticsearch with basics of Lucene
PDF
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
PPTX
Introduction to Azure SQL DB
PDF
Azure App Modernization
PDF
Introduction to Azure
PDF
DI&A Slides: Data Lake vs. Data Warehouse
Model Driven PowerApps
Building Modern Data Platform with Microsoft Azure
Kafka Streams: What it is, and how to use it?
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Azure Synapse Analytics Overview (r2)
Microsoft Cloud Adoption Framework for Azure: Thru Partner Governance Workshop
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Azure Migration Program Pitch Deck
Architect’s Open-Source Guide for a Data Mesh Architecture
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Microsoft Azure Databricks
Couchbase Day
Big data architectures and the data lake
Graph-Based Customer Journey Analytics with Neo4j
Introduction to Elasticsearch with basics of Lucene
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Introduction to Azure SQL DB
Azure App Modernization
Introduction to Azure
DI&A Slides: Data Lake vs. Data Warehouse
Ad

Similar to Case Study: Stream Processing on AWS using Kappa Architecture (20)

PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Building Serverless Data Infrastructure in the AWS Cloud
PDF
Cloud Lambda Architecture Patterns
PDF
Building Big Data Streaming Architectures
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PDF
Big data on aws
PDF
ASPgems - kappa architecture
PDF
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
PDF
Big data and Analytics on AWS
PDF
Towards Data Operations
PDF
Kappa vs Lambda Architectures and Technology Comparison
PDF
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
PPTX
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
PPTX
Innovations and trends in Cloud. Connectfest Porto 2019
PDF
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
PDF
It's Time To Stop Using Lambda Architecture
PPTX
IIoT_ML_Architechure_AWS
PDF
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
PDF
Single View of Data
PDF
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Building Serverless Data Infrastructure in the AWS Cloud
Cloud Lambda Architecture Patterns
Building Big Data Streaming Architectures
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Big data on aws
ASPgems - kappa architecture
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Big data and Analytics on AWS
Towards Data Operations
Kappa vs Lambda Architectures and Technology Comparison
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
Innovations and trends in Cloud. Connectfest Porto 2019
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
It's Time To Stop Using Lambda Architecture
IIoT_ML_Architechure_AWS
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Single View of Data
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Mushroom cultivation and it's methods.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Spectroscopy.pptx food analysis technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
A Presentation on Artificial Intelligence
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
cloud_computing_Infrastucture_as_cloud_p
TLE Review Electricity (Electricity).pptx
Spectral efficient network and resource selection model in 5G networks
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A comparative analysis of optical character recognition models for extracting...
Mushroom cultivation and it's methods.pdf
Encapsulation theory and applications.pdf
Getting Started with Data Integration: FME Form 101
Spectroscopy.pptx food analysis technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MIND Revenue Release Quarter 2 2025 Press Release
gpt5_lecture_notes_comprehensive_20250812015547.pdf
OMC Textile Division Presentation 2021.pptx
Network Security Unit 5.pdf for BCA BBA.
Heart disease approach using modified random forest and particle swarm optimi...
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Case Study: Stream Processing on AWS using Kappa Architecture

  • 1. Stream Processing on AWS using Kappa Architecture Joey Bolduc-Gilbert [email protected]
  • 3. 4.3B People depend on fish for key protein 50% Of all fish protein comes from farming 2x More fish than any other animal protein
  • 4. Feed Conversion and Water Footprint 1857 6.8 756 2.9 469 1.7 3.8 1.1 Water for 1 lbs of meat (gallons) Feed for 1 lbs of meat (lbs)
  • 5. Lost to disease and poor management for a typical shrimp farmer -50%
  • 6. Aquaculture technology gap today Manual sampling Visual inspection Non-digital records
  • 7. Aqua Farming Analytics Are a Tool for Change
  • 8. Collect data Ingest, store, process, serve data Consume data IoT Devices Data SaaS Annotate data, train models Machine Learning / AI The XpertSea Platform
  • 10. Which challenges are we facing? Extracting value out of that data ● Highly distributed ● Need for some near real-time metrics ● Large scale aggregations (region and industry wide) ● Unreliable networks ● And many more!
  • 11. The CAP theorem Consistency Availability Partition tolerance Pick 2?
  • 13. What is a data system? Working around the CAP Theorem ● Simple equation, Query = Function(All data) ● A data system answers questions about a dataset ● Lots of complexity caused by the mutability of data ● You obviously cannot process all your data from scratch for every query
  • 15. Upsides & Downsides ● Immutability of the data lake ● More traceability ● Ensure you can make your system evolve quickly ● Designed to scale Strengths Weakness ● Complexity of maintaining two layers ● It doesn’t really beat CAP, it just reduces its complexity
  • 18. The AWS Services we use ● S3 ● API Gateway ● SQS/SNS ● AWS Lambda ● ECS ● DynamoDB ● RDS ● CloudFormation ● CloudWatch ● IAM ● CloudFront ● Route 53 ● Cognito (soon) ● SES (soon)
  • 19. The core system Data Ingestion API Gateway AWS Lambda Data lake (S3) Metadata Store (DynamoDB)
  • 20. The core system Data lake { "id": "0000d9f6-045d-438b-8058-4ee6447ba0fa", "parent_id": "", "timestamp": 1517801441, "key": "event/0000d9f6-045d-438b-8058-4ee6447ba0fa/payload.json", "schema": "WorkflowV1.json", "type": "Workflow", "data": { "id": "0000d9f6-045d-438b-8058-4ee6447ba0fa", "timestamp": 1517799978, "created_at": 1517801441, "created_by": "xyz" // .... } } ● Stored in JSON for simplicity ● Metadata is copied to DynamoDB ● JSON Schema to handle validation ● Remember, all data is immutable! A typical entry
  • 21. The core system Data Processing ● Generally a preprocessor to read from the stream and compute the latest data; ● A number of processors to perform the hard work; ● Generally a producer to cache the results somewhere; ● Chain as many as you need! { "schema": "WorkflowV1.json", "data": { "id": "0000d9f6-045d-438b-8058-4ee6447ba0fa", "timestamp": 1517799978, "created_at": 1517801441.1827073, "created_by": "portal", "computed_1": 10, "computed_2": 142, // .... }, "event_ids": [ "0000d9f6-045d-438b-8058-4ee6447ba0fa", "id1", "id2" ] } A typical message
  • 22. The core system Data Processing (Serverless) AWS Lambda AWS Lambda AWS Lambda AWS Lambda AWS Lambda DynamoDB RDS
  • 23. The core system Data Processing (Containers) Amazon SQS ECS + Docker Amazon SQS ECS + Docker DynamoDB RDS
  • 24. AWS Lambda AWS Lambda DynamoDB RDS Amazon SQS ECS + Docker Of course we can mix and match!
  • 25. The core system Data Processing AWS Lambda We use a publisher/subscriber model to notify dependencies (higher level queries or aggregations): ● A new result is now ready to be used ● A old value was recomputed due to additional data This can trigger immediate recalculation, or be queued to be processed as part of a batch. SNS is the obvious choice for this! Amazon SNS AWS Lambda
  • 26. The core system Serving layer (API) API Gateway AWS Lambda DynamoDB RDS Route 53 S3 bucket
  • 27. The core system Serving layer (Web App) S3 bucket Route 53 CloudFront Polymer
  • 28. Scaling concerns ● Multi-region to reduce latency ● New type of data? New pipeline ● Most of the system is serverless, or at least managed ● Serving data layer might need to move to Dynamo in the future ● Keeping it in a relation DB for now to facilitate our Machine Learning training framework integration How to scale?
  • 29. Some tools we use and love Polymer Python 3 with Troposphere
  • 30. Tools we choose not to use ● Amazon Kinesis and Kinesis Firehose (pricing) ● AWS IoT (useful for a large amount of simple devices) ● CloudWatch for log processing (we like ELK stacks better) ● Cassandra/Hadoop (too complex for now)
  • 31. Resources and references On the CAP Theroem: https://codahale.com/you-cant-sacrifice-partition-tolerance/ On Lambda Architecture: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html On Kappa Architecture: https://www.oreilly.com/ideas/questioning-the-lambda-architecture
  • 32. Q & A