Building a real-time data processing pipeline using Apache Kafka, Kafka Connect, Elasticsearch and Kibana

Building a Real-Time
Data Processing Pipeline
Using Apache Kafka, Kafka Connect,
Elasticsearch, and Kibana
Paul Brebner
Instaclustr—Technology Evangelist
Instaclustr Sponsored Booth Presentation
30 September ApacheCon 2020
©Instaclustr Pty Limited, 2020

Blogs (54): www.instaclustr.com/paul-brebner/
Who Am I? What do I do?
1 year ago (ApacheCon Europe 2019)—Look it’s a light!

Who Is Instaclustr?

A complete ecosystem to support mission
critical applications.
Instaclustr Managed Platform

Open Source
Big Data
Technologies

Open Source
Big Data
Technologies
on the
Instaclustr
Managed
Platform

Open Source
Big Data
Technologies
on the
Instaclustr
Managed
Platform
on multiple
cloud providers

This talk…
Focuses on three
recent additions
to our managed
platform:
• Kafka Connect
• Elasticsearch
• Kibana

• Technology Overview
• What’s the Story?
• Data sources
• Provisioning clusters
• Configuring Kafka source and sink connectors
• Elasticsearch mappings
• Kibana Visualizations
• Elasticsearch Ingest Pipeline
• Kibana Maps
• Handling failure
Overview

In general integration can be—complicated…

● Zero-code integration
● High availability
● Elastic scaling independent of Kafka
Source Connectors Sink Connectors
Kafka Connect cluster
Syslog
Kafka Cluster
and many more.. and many more..
Sources Sinks
Or—Easy, with Kafka Connect
What?
Why?
● Distributed solution to integrate Kafka with
other heterogeneous data sources/stores.
● Connectors (source or sink) handle
specifics of particular integrations
o Source Kafka
o Kafka Sink

Elasticsearch—scalable
search of indexed
documents
Kibana—visualization
Open Distro for
Elasticsearch—100% Apache
2.0 licensed
Documents
Indices
Managed Elasticsearch + Kibana

What’s The Story?

What’s The Story?
Kafka Summit—CDC
built Kafka COVID-19
pipeline in < 30 days

What’s The Story?
Instaclustr consultants built
an integration demo using
public climate change data
via REST connectors running
on docker
Kafka Summit—CDC

What’s The Story?
on docker
Idea: Use streaming REST
public data sources
AND deploy on Instaclustr
managed platform
Kafka Summit—CDC

What’s The Story?
on docker
Idea: Use streaming REST
public data sources
AND deploy on Instaclustr
managed platform
Look for public streaming
REST APIs with easy to use
JSON data format, complete
data, interesting domain,
not political or apocalyptic…
Impossible?
Kafka Summit—CDC

https://oceanservice.noaa.gov/
Success! Tides follow Lunar Day
USA Tidal Data
National Oceanic and Atmospheric Administration

Bonus, NOAA tidal map https://tidesandcurrents.noaa.gov/map/

Bonus, NOAA tidal map https://tidesandcurrents.noaa.gov/map/
What’s here?

API description https://api.tidesandcurrents.noaa.gov/api/prod/

REST Example
Specify station ID, data type and datum
(I used water level, mean sea level), latest data point, JSON
Call
https://api.tidesandcurrents.noaa.gov/api/prod/datagetter?date=latest&station=8724580&
product=water_level&datum=msl&units=metric&time_zone=gmt&application=instaclustr&
format=json
Returns
{"metadata": {
"id":"8724580",
"name":"Key West",
"lat":"24.5508”,
"lon":"-81.8081"},
"data":[{
"t":"2020-09-24 04:18",
"v":"0.597",
"s":"0.005", "f":"1,0,0,0", "q":"p"}]}

REST call
JSON result
Let’s start the pipeline using this
REST API for data sources…

What Else Do We Need?
The Instaclustr Console
Provision Kafka and
Kafka Connect clusters

Select cloud
provider, region,
instance size and
number, security etc.

Tell Kafka connect
cluster which Kafka
cluster to use, then
provision
Your IP

Now we have a Kafka and Kafka Connect clusters

Next, find a REST connector, deploy to S3 bucket, tell connect cluster
which bucket, configure connector and run
REST source
connector
Tides Topic
REST call
JSON result
(Automatically created)
BYO connectors instructions
https://www.instaclustr.com/support/documentation/kafka-
connect/accessing-and-using-kafka-connect/updating-custom-
connectors/

curl https://connectorClusterIP:8083/connectors -k -u name:password -X POST -H 'Content-Type: application/json' -d '
{
"name": "source_rest_tide_1",
"config": {
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"connector.class": "com.tm.kafka.connect.rest.RestSourceConnector",
"tasks.max": "1",
"rest.source.poll.interval.ms": "600000",
"rest.source.method": "GET",
"rest.source.url":
"https://api.tidesandcurrents.noaa.gov/api/prod/datagetter?date=latest&station=8454000&product=water_level&datum=
msl&units=metric&time_zone=gmt&application=instaclustr&format=json",
"rest.source.headers": "Content-Type:application/json,Accept:application/json",
"rest.source.topic.selector": "com.tm.kafka.connect.rest.selector.SimpleTopicSelector",
"rest.source.destination.topics": "tides-topic"
}
}'
REST source connector configuration including connector
name, class, URL, topic

Polls every 10 minutes, writes result to Kafka topic, picked 5 sensors
to use, so 5 connector instances running.
Now have tidal data coming into the tides topic, what next?
REST source
connector
Tides Topic
REST call
JSON result
{"metadata": {
"id":"8724580",
"name":"Key West",
"lat":"24.5508”,
"lon":"-81.8081"},
"data":[{
"t":"2020-09-24 04:18",
"v":"0.597"}]}

Next - Provision Elasticsearch+Kibana clusters

And configure the included Elasticsearch sink connector
to send data to Elasticsearch
REST source
connector
Tides Topic
REST call
JSON result
{"metadata": {
"id":"8724580",
"name":"Key West",
"lat":"24.5508”,
"lon":"-81.8081"},
"data":[{
"t":"2020-09-24 04:18",
"v":"0.597"}]}
Elastic sink connector Tides Index

curl https://connectorClusterIP:8083/connectors -k -u name:password -X POST -H 'Content-Type: application/json' -d '
{
"name" : "elastic-sink-tides",
"config" :
{
"connector.class" : "com.datamountaineer.streamreactor.connect.elastic7.ElasticSinkConnector",
"tasks.max" : 3,
"topics" : "tides",
"connect.elastic.hosts" : ”ip",
"connect.elastic.port" : 9201,
"connect.elastic.kcql" : "INSERT INTO tides-index SELECT * FROM tides-topic",
"connect.elastic.use.http.username" : ”elasticName",
"connect.elastic.use.http.password" : ”elasticPassword"
}
}'
Configure sink connector name, class, index and topic.
The index is created with default mappings if it doesn’t already exist.

REST source
connector
Tides Topic
REST call
JSON result
{"metadata": {
"id":"8724580",
"name":"Key West",
"lat":"24.5508”,
"lon":"-81.8081"},
"data":[{
"t":"2020-09-24 04:18",
"v":"0.597"}]}
Great! It’s All Working!? Sort Of!
Tide data arriving in Tides Index!
But, in default index mappings, everything is a String.
To graph them as time series by name need a custom mapping.
{"metadata": {
"id":”String",
"name":”String",
"lat":”String”,
"lon":”String"},
"data":[{
"t":”String",
"v":”String"}]}

curl -u elasticName:elasticPassword ”elasticURL:9201/tides-index" -X PUT -H 'Content-Type: application/json' -d'
{
"mappings" : {
"properties" : {
"data" : {
"properties" : {
"t" : { "type" : "date",
"format" : "yyyy-MM-dd HH:mm"
},
"v" : { "type" : "double" },
"f" : { "type" : "text" },
"q" : { "type" : "text" },
"s" : { "type" : "text" }
}
},
"metadata" : {
"properties" : {
"id" : { "type" : "text" },
"lat" : { "type" : "text" },
"long" : { "type" : "text" },
"name" : { "type" : ”keyword" } }}}} }'
Custom mapping “t” is a date, “v” is a double, and “name” is a keyword.

• Every time you
• Change an Elasticsearch index mapping, you have to
• Delete the index
• Index all the data again
• But where does the data come from?
• Two options:
• Using a Kafka sink connector the data is already in the
Kafka topic, so just replay it, or,
• Use Elasticsearch reindex operation
• The hard part is over, now…
Reindexing!

Start Kibana With A Single Click

Visualization Steps
1: Index Pattern (to get data from Elasticsearch)
Settings -> Index Patterns -> Create Index Pattern -> Define ->
Configure with “t” as timefilter field
2. Create Visualization (to create a graph type)
Visualizations -> Create Visualization -> New Visualization ->
Line -> Choose Source = pattern from 1
3. Configure Graph Settings (to display data correctly)
Select time range, select aggregation for y-axis = average ->
data.v -> select Buckets -> Split series metadata.name -> X-axis
-> Data Histogram = data.t

Time (x axis) vs. average (over 30m) tide level (relative to
average level) in meters for the 5 sample stations

Showing Lunar Day (24 hours 50 minutes)
Lunar Day (24h 50m)

Tidalrange
Showing Tidal Range (high tide – low tide)

By R. Ray, NASA Goddard Space Flight Center, Jet Propulsion Laboratory, Scientific Visualization Studio - TOPEX/Poseidon:
Revealing Hidden Tidal Energy, Public Domain
Tide range varies depending on moon, sun, local geography, and weather!

Neah Bay is near here

Australia’s Biggest Tide is here

Tides of over 11 meters are forced through two narrow passes
creating the popular tourist attraction known as the Horizontal
Waterfalls in the Kimberley's Talbot Bay.
Next, a map to show the sensor locations to understand tidal ranges
(Photo by Richard Costin)

But, there are no geo-points in the data!

Mapping Steps
1. Add geo-point field to index mapping
2. Create Elasticsearch ingest pipeline to construct new field
3. Add as default ingest pipeline to index
Problem:
Elasticsearch doesn’t
recognize separate lat
lon fields as geo-points
Solution:
Add an Elasticsearch
ingest pipeline to pre-
process documents
before they are
indexed
(Need to reindex again)

curl -u elasticName:elasticPassword ”elasticURL:9201/tides-index" -X PUT -H 'Content-Type: application/json' -d'
{
"mappings" : {
"properties" : {
"data" : {
"properties" : {
"t" : { "type" : "date",
"format" : "yyyy-MM-dd HH:mm"
},
"v" : { "type" : "double" },
"f" : { "type" : "text" },
"q" : { "type" : "text" },
"s" : { "type" : "text" }
}
},
"metadata" : {
"properties" : {
"id" : { "type" : "text" },
"lat" : { "type" : "text" },
"long" : { "type" : "text" },
"location" : { "type" : "geo_point" },
"name" : { "type" : ”keyword" } }}}} }'
1. Add a new “location” field with a geo_point data type to the mapping and index

curl -u elasticName:elasticPassword ”elasticURL:9201/ _ingest/pipeline/locationPipe" -X PUT -H 'Content-Type:
application/json' -d'
{
"description" : ”construct geo-point String field",
"processors" : [
{
"set" : {
"field": "metadata.location",
"value": "{{metadata.lat}},{{metadata.lon}}"
}
}
]
}
'
2. Create new ingest pipeline to construct new location geo-point
String from existing lat lon fields

3. Add locationPipe as default pipeline to the index
curl -u elasticName:elasticPassword ”elasticURL:9201/tides-index/_settings?pretty" -X PUT -H 'Content-Type:
application/json' -d'
{
"index" : {
"default_pipeline" : ”locationPipe"
}
}
'

REST source
connector
Tides Topic
REST call
JSON result
{"metadata": {
"id":"8724580",
"name":"Key West",
"lat":"24.5508”,
"lon":"-81.8081"},
"data":[{
"t":"2020-09-24 04:18",
"v":"0.597"}]}
Now we have a pipeline transforming the raw data and adding
geo-point location data in Elasticsearch
{"metadata": {
"id":"8724580",
"name":"Key West",
"lat":"24.5508”,
"lon":"-81.8081”,
”location”: “24.5508,-81.8081”},
"data":[{
"t":"2020-09-24 04:18",
"v":"0.597"}]}
LocationPipe
ingestor

Mapping Visualization Steps
1. Create Visualization
Visualizations -> Create visualization -> New Coordinate Map
-> Select index patterns -> Visualization with default map
2. Configure Graph Settings (to display data correctly)
Select Metrics -> Aggregation (min) -> Field -> data.v -> Buckets -> Geo
coordinates -> Geohash -> Field -> metadata.location
Reuse existing
index pattern

Map showing sensor locations and min values over last week

Add your own custom Web Map Service (WMS) layers
URL https://services.nationalmap.gov/arcgis/services/USGSNAIPPlus/MapServer/WMSServer
Layers 1,2,3,5,6,7,9,10,11,13,14,15,17,18,19,21,22,23,25,26,27,29,30,31,32

REST source
connector
Tides Topic
REST call
JSON result
{"error": {"message":"No
data was found. This
product may not be
offered at this station
at the requested
time."}}
What can go wrong? REST call can return error message, but doesn’t treat it as an
error so it’s sent to Tides Topic.
LocationPipe
ingestor

REST source
connector
Tides Topic
REST call
JSON result
product may not be
at the requested
time."}}
Elastic sink connector tries to read the error message and fails to FAILED state.
Exceptions viewable in the Kafka connect logs topic.
LocationPipe
ingestor
X
FAILED
X
Connect logs topic

REST source
connector
Tides Topic
REST call
JSON result
product may not be
at the requested
time."}}
Current workaround is to monitor and regularly restart failed connectors.
LocationPipe
ingestor
FAILED?
RUNNING
Restart!
X

REST source
connector
Tides Topic
REST call
JSON result
product may not be
at the requested
time."}}
Better solution - if connectors support KIP-298 “Error Handling in Connect” (not all do)
then configure to ignore input errors.
Errors sent to ”dead letter” topic.
LocationPipe
ingestor
Ignore
Dead letter topic

• Instaclustr consultants, Kafka and Elasticsearch dev teams ,
graphic design and marketing teams
• Zeke, Mussa, Michael, Hendra, Rob, Harvey, Jill, Gina and
more!
• Try us out! Build the same or your own pipeline with our
free trial at Instaclustr.com
Thanks to…

www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!

Building a real-time data processing pipeline using Apache Kafka, Kafka Connect, Elasticsearch and Kibana

More Related Content

What's hot (20)

Similar to Building a real-time data processing pipeline using Apache Kafka, Kafka Connect, Elasticsearch and Kibana (20)

More from Paul Brebner (20)

Recently uploaded (20)

Building a real-time data processing pipeline using Apache Kafka, Kafka Connect, Elasticsearch and Kibana