Case Study: Stream Processing on AWS using Kappa Architecture

Stream Processing on AWS using
Kappa Architecture
Joey Bolduc-Gilbert
joey@xpertsea.com

4.3B
People depend on fish
for key protein
50%
Of all fish protein
comes from farming
2x
More fish than any
other animal protein

Feed Conversion and Water Footprint
1857
6.8
756
2.9
469
1.7
3.8
1.1
Water for 1 lbs of meat (gallons)
Feed for 1 lbs of meat (lbs)

Lost to disease and poor management
for a typical shrimp farmer
-50%

Aquaculture technology gap today
Manual sampling Visual inspection Non-digital records

Aqua Farming Analytics
Are a Tool for Change

Collect data Ingest, store, process, serve data Consume data
IoT Devices Data SaaS
Annotate data, train models
Machine Learning / AI
The XpertSea Platform

Aquaculture Data
Animals
Water quality
Genetics
Ocean
Production
Transactions
Location
Equipment
Feed
Weather
Diseases

Which challenges are we facing?
Extracting value out of that data
● Highly distributed
● Need for some near real-time metrics
● Large scale aggregations (region and industry wide)
● Unreliable networks
● And many more!

The CAP theorem
Consistency Availability Partition tolerance
Pick 2?

What is a data system?
Working around the CAP Theorem
● Simple equation, Query = Function(All data)
● A data system answers questions about a dataset
● Lots of complexity caused by the mutability of data
● You obviously cannot process all your data from scratch
for every query

Upsides & Downsides
● Immutability of the data lake
● More traceability
● Ensure you can make your system evolve quickly
● Designed to scale
Strengths
Weakness
● Complexity of maintaining two layers
● It doesn’t really beat CAP, it just reduces its complexity

XpertSea’s
Deepwater platform

The AWS Services we use
● S3
● API Gateway
● SQS/SNS
● AWS Lambda
● ECS
● DynamoDB
● RDS
● CloudFormation
● CloudWatch
● IAM
● CloudFront
● Route 53
● Cognito (soon)
● SES (soon)

The core system
Data Ingestion
API Gateway AWS Lambda
Data lake (S3)
Metadata Store
(DynamoDB)

The core system
Data lake
{
"id": "0000d9f6-045d-438b-8058-4ee6447ba0fa",
"parent_id": "",
"timestamp": 1517801441,
"key": "event/0000d9f6-045d-438b-8058-4ee6447ba0fa/payload.json",
"schema": "WorkflowV1.json",
"type": "Workflow",
"data": {
"id": "0000d9f6-045d-438b-8058-4ee6447ba0fa",
"timestamp": 1517799978,
"created_at": 1517801441,
"created_by": "xyz"
// ....
}
}
● Stored in JSON for simplicity
● Metadata is copied to DynamoDB
● JSON Schema to handle validation
● Remember, all data is immutable!
A typical entry

The core system
Data Processing
● Generally a preprocessor to read from
the stream and compute the latest data;
● A number of processors to perform the
hard work;
● Generally a producer to cache the results
somewhere;
● Chain as many as you need!
{
"schema": "WorkflowV1.json",
"data": {
"id": "0000d9f6-045d-438b-8058-4ee6447ba0fa",
"timestamp": 1517799978,
"created_at": 1517801441.1827073,
"created_by": "portal",
"computed_1": 10,
"computed_2": 142,
// ....
},
"event_ids": [
"0000d9f6-045d-438b-8058-4ee6447ba0fa",
"id1",
"id2"
]
}
A typical message

The core system
Data Processing (Serverless)
AWS Lambda
AWS Lambda
AWS Lambda
AWS Lambda
AWS Lambda
DynamoDB
RDS

The core system
Data Processing (Containers)
Amazon SQS
ECS + Docker
Amazon SQS
ECS + Docker
DynamoDB
RDS

AWS Lambda
AWS Lambda
DynamoDB
RDS
Amazon SQS
ECS + Docker
Of course we can mix and match!

The core system
Data Processing
AWS Lambda
We use a publisher/subscriber model to notify
dependencies (higher level queries or aggregations):
● A new result is now ready to be used
● A old value was recomputed due to additional data
This can trigger immediate recalculation, or be queued to
be processed as part of a batch.
SNS is the obvious choice for this!
Amazon SNS
AWS Lambda

The core system
Serving layer (API)
API Gateway AWS Lambda
DynamoDB
RDS
Route 53
S3 bucket

The core system
Serving layer (Web App)
S3 bucket
Route 53 CloudFront
Polymer

Scaling concerns
● Multi-region to reduce latency
● New type of data? New pipeline
● Most of the system is serverless, or at least managed
● Serving data layer might need to move to Dynamo in the future
● Keeping it in a relation DB for now to facilitate our Machine Learning
training framework integration
How to scale?

Some tools we use and love
Polymer
Python 3 with Troposphere

Tools we choose not to use
● Amazon Kinesis and Kinesis Firehose (pricing)
● AWS IoT (useful for a large amount of simple devices)
● CloudWatch for log processing (we like ELK stacks better)
● Cassandra/Hadoop (too complex for now)

Resources and references
On the CAP Theroem: https://codahale.com/you-cant-sacrifice-partition-tolerance/
On Lambda Architecture: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
On Kappa Architecture: https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Case Study: Stream Processing on AWS using Kappa Architecture

More Related Content

What's hot (20)

Similar to Case Study: Stream Processing on AWS using Kappa Architecture (20)

Recently uploaded (20)

Case Study: Stream Processing on AWS using Kappa Architecture