SlideShare a Scribd company logo
Continuous SQL with Kafka and
Flink
Tim Spann
Principal Developer Advocate
February 20, 2024
https://www.meetup.com/dba-fundamentals-group/events/296855261/
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
3
Tim Spann
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC, ex-HPE
https://medium.com/@tspann
https://github.com/tspannhw
4
FLaNK Stack Weekly by Tim Spann
This week in Apache NiFi, Apache Flink,
Apache Kafka, ML, AI, Apache Spark, Apache
Iceberg, Python, Java and Open Source
friends.
https://bit.ly/32dAJft
https://www.meetup.com/futureofdata-
princeton/
© 2023 Cloudera, Inc. All rights reserved. 5
Future of Data - NYC + NJ + Philly + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
6
AGENDA
Introduction
Overview
Streaming Projects
Streaming Analytics
Demos
Resources
Q&A
FLANK
© 2023 Cloudera, Inc. All rights reserved. 8
BUILDING REAL-TIME REQUIRES A TEAM
© 2023 Cloudera, Inc. All rights reserved. 9
Already using Spark? Need NiFi? Need Flink?
Want unified Batch/Stream?
Want highest Throughput?
Don’t need low latency?
Large files?
Scheduled batches?
Replacing Sqoop, ETL
Simple JDBC queries?
Transform individual records?
Want easy development?
Lots of small files, events, records, rows?
Continuous stream of rows
Support many different sources
Need Microservices, Batch and Stream?
Want high Throughput?
Want Low Latency?
Want Advanced Windowing and State?
Happy with a New Solution that is
best-in-class?
Spark, NiFi, Flink? Which engine to choose?
10
Live Q&A
Travel Advisories
Weather Reports
Documents
Social Media
Databases
Transactions
Public Data Feeds
S3 / Files
Logs
ATM Data
Live Chat
…
HYBRID CLOUD
INTERACT
COLLECT STORE
ENRICH, REPORT
Distribute
Collect
Report
REPORT
Visualize
Report, Automate
AI BASED ENHANCEMENTS
Predict, Automate
VECTOR DATABASE
LLM
Machine
Learning
Data
Visualization
Data Flow
Data
Warehouse
SQL
Stream Builder
Data
Visualization
Input Sentences
Generated Text
Timestamp
Input Sentence
Timestamps
Enrichments
Messaging
Broker
Real-time alerting
Real-time alerting
Aggregations
EXAMPLES
12
Analytics-in-Stream
Data Sources Streaming Storage
Substrate
Cloudera Stream Processing
Kafka + NiFi enables
real-time ingestion into
lakes / analytics services
Data Distribution
Service
Cloudera DataFlow
Warehouses & Operational DB
Data Lakes & Lake Houses
Data-At-Rest Analytics
Data Apps Powered by
Streaming Insights and used
by other Analytics Services
Kafka + Flink
enables streaming
analytics
Cloudera Stream Processing
Streaming
Analytics
Low Latency
Data Products
Data-In-Motion Streaming Analytics
Cloudera Edge Flow
Edge Ingest
13
Flink
Connectors
Streaming Data Pipelines with Cloudera Data Platform
Edge
devices
Cloud DB’s
SaaS tools
Change Data
Capture
On prem
apps/DB’s
Streams
Any
destination
Custom
Kafka
producers
Custom
Kafka
consumers
Operational
Applications
Analytical
Applications
STREAMING DATA MOVEMENT & PROCESSING STREAMING DATA PROCESSING & ANALYTICS
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
16
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
APACHE KAFKA
© 2023 Cloudera, Inc. All rights reserved. 20
What is Can You Do With Apache Kafka?
Web site activity: track page views, searches, etc. in real time
Events & log aggregation: particularly in distributed systems where messages
come from multiple sources
Monitoring and metrics: aggregate statistics from distributed applications and
build a dashboard application
Stream processing: process raw data, clean it up, and forward it on to another
topic or messaging system
Real-time data ingestion: fast processing of a very large volume of messages
© 2019 Cloudera, Inc. All rights reserved. 21
STREAMS MESSAGING WITH KAFKA
• Highly reliable distributed messaging system.
• Decouple applications, enables many-to-many
patterns.
• Publish-Subscribe semantics.
• Horizontal scalability.
• Efficient implementation to operate at speed with big
data volumes.
• Organized by topic to support several use cases.
APACHE FLINK
23
CONTINUOUS SQL
● SSB is a Continuous SQL engine
● It’s SQL, but a slightly different mental model, but with big implications
Traditional Parse/Execute/Fetch model Continuous SQL Model
Hint: The query is boundless and never finishes, and time matters
AKA: SELECT * FROM foo WHERE 1=0 -- will run forever
24
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
25
© 2022 Cloudera, Inc. All rights reserved.
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
26
SSB MATERIALIZED VIEWS
Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
© 2019 Cloudera, Inc. All rights reserved. 27
ICEBERG INTEGRATION
Robust Next Generation Architecture for Data Driven Business
Unified Processing Engine Massive Open table format
Iceberg Support for Flink APIs through SSB
• Maximally open
• Maximally flexible
• Ultra high performance for MASSIVE data
DEMO AND CODE
29
Continuous SQL
select max(alt_baro) as MaxAltitudeFeet, min(alt_baro) as MinAltitudeFeet, avg(alt_baro) as AvgAltitudeFeet,
max(alt_geom) as MaxGAltitudeFeet, min(alt_geom) as MinGAltitudeFeet, avg(alt_geom) as AvgGAltitudeFeet,
max(gs) as MaxGroundSpeed, min(gs) as MinGroundSpeed, avg(gs) as AvgGroundSpeed,
count(alt_baro) as RowCount,
hex as ICAO, flight as IDENT
from `sr1`.`default_database`.`adsb`
group by flight, hex;
select transcom.title, transcom.description, mta.VehicleRef,
DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) as miles,
mta.StopPointName, mta.Bearing, mta.DestinationName, mta.ExpectedArrivalTime, mta.VehicleLocationLatitude, mta.VehicleLocationLongitude,
mta.ArrivalProximityText, mta.DistanceFromStop, mta.AimedArrivalTime, mta.`Date`, mta.ts, mta.uuid, mta.EstimatedPassengerCapacity, mta.EstimatedPassengerCount
from `schemareg1`.`default_database`.`mta` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ mta
FULL OUTER JOIN `schemareg1`.`default_database`.`transcom` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ transcom
ON (transcom.latitude >= CAST(mta.VehicleLocationLatitude as float) - 0.3)
AND (transcom.longitude >= CAST(mta.VehicleLocationLongitude as float) - 0.3)
AND (transcom.latitude <= CAST(mta.VehicleLocationLatitude as float) + 0.3)
AND (transcom.longitude <= CAST(mta.VehicleLocationLongitude as float) + 0.3)
WHERE mta.VehicleRef is not null
AND transcom.title is not null
AND DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) <= 120
30
Real-time observability pipeline
Minfi agents
Raw Logs
Cloudera Data Flow
Cloudera Data
Lakehouse
Triaging ML Models
Threat Hunting
Response and
Investigation
UEBA/Fraud
Detection
Reports
Auto Action
Cloudera Streaming
Analytics
Cybersec
Toolkits
Parse, Triage, Profile
Cloudera Streams
Processing
Kafka
SQL Stream
Builder
SPLUNK / SIEM
/EXTERNAL
Cloudera Machine Learning
Collect Route/Filter/
Transform
Prepare/Analyze/
Alert
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
33
Data in Motion: Overview e Novidades do NiFi, Kafka e Flink
Apresentador: Tim Spann - Principal DIM Specialist and Developer Advocate
https://medium.com/cloudera-inc/transit-in-sao-paulo-brasil-flank-style-eaec6753cc63
34
35
36
CDC ENGINE SELECTION
HOW TO DO IT?
© 2023 Cloudera, Inc. All rights reserved. 38
Already using Kafka? Already using NiFi? Need for Fast Flink?
Simple setup for many tables
Want metadata augmented data
Don’t need low latency?
Visual monitoring
Easy manual scaling
Easy to combine with NiFi
Debezium
Simple JDBC queries?
Transform individual records?
Want easy development with UI?
Lots of small files, events, records, rows?
Continuous stream of rows
Support many different sources
Debezium coming
Strong control of table and joins
Want high Throughput?
Want Low Latency?
Want Advanced Windowing and State?
Automatic records immediately
Pure SQL
Debezium
Kafka Connect, NiFi, Flink? Which engine to choose? Or All 3?
CDC ARCHITECTURE - Using FLaNK to pull the data out of anything in near-real time
INGEST PREPARE PUBLISH
DATA SOURCES
Internal Users
(After Sales)
External
Systems
ENTERPRISE
LAKEHOUSE
CAPABILITY VIEW
INGESTION
ENTERPRISE DATA
MESSAGE HUB
STORAGE
BATCH
MANAGEMENT
STREAM
CONSUMPTION
Closed Loop
Systems
SQL Stream Builder
Machine Learning
Data Visualization
Workload Manager
watsonx.data
40
Data Distribution as a Universal, Hybrid, Multi-Cloud Data Service
Universal Data Distribution Service
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest Gateway
Router, Filter &
Transform Processors
Destination
Processors
Cloud Business Process Services*
Log Data Sources
Laptops /
Servers Security
Agents
IOT Devices App Logs
Mobile Apps
Cloud Data Analytics/ Service *
On-Prem Data Sources Cloud Warehouse
(Cloudera DW)
Big Data Cloud Services
Multi-Cloud Data Distribution Service that Solves the First & Last Mile Problem for the Modern Data Stack
CDC with SQL Stream Builder
(Flink SQL)
© 2023 Cloudera, Inc. All rights reserved. 42
Streaming CDC with Cloudera SQL Stream Builder (Flink SQL)
https://github.com/tspannhw/FLaNK-CDC/blob/main/flinkcdc.MD
© 2023 Cloudera, Inc. All rights reserved. 43
https://docs.cloudera.com/csa/1.10.0/how-to-ssb/topics/csa-ssb-cdc-connectors.html
CDC with Debezium and Flink
SQL Stream Builder with Flink SQL
© 2023 Cloudera, Inc. All rights reserved. 44
CDC with Debezium and Flink
SQL Stream Builder with Flink SQL
© 2023 Cloudera, Inc. All rights reserved. 45
© 2023 Cloudera, Inc. All rights reserved. 46
CREATE TABLE `postgres_cdc_newjerseybus` (
`title` STRING,
`description` STRING,
`link` STRING,
`guid` STRING,
`advisoryAlert` STRING,
`pubDate` STRING,
`ts` STRING,
`companyname` STRING,
`uuid` STRING,
`servicename` STRING
) WITH (
'connector' = 'postgres-cdc',
'database-name' = 'tspann',
'hostname' = '192.168.1.153',
'password' = 'tspann',
'decoding.plugin.name' = 'pgoutput',
'schema-name' = 'public',
'table-name' = 'newjerseybus',
'username' = 'tspann',
'port' = '5432'
);
Flink SQL Tables - Debezium CDC From Database Tables
© 2023 Cloudera, Inc. All rights reserved. 47
Flink SQL Tables - Upsert to Kafka Topics
CREATE TABLE `upsert_kafka_newjerseybus` (
`title` String,
`description` String,
`link` String,
`guid` String,
`advisoryAlert` String,
`pubDate` String,
`ts` String,
`companyname` String,
`uuid` String,
`servicename` String,
`eventTimestamp` TIMESTAMP(3),
WATERMARK FOR `eventTimestamp` AS `eventTimestamp` - INTERVAL '5' SECOND,
PRIMARY KEY (uuid) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'kafka_newjerseybus',
'properties.bootstrap.servers' = 'kafka:9092',
'key.format' = 'json',
'value.format' = 'json'
);
RESOURCES/WRAP-UP
https://github.com/tspannhw/FLaNK-Transit
SELECT n.speed, n.travel_time, n.borough, n.link_name, n.link_points,
n.latitude, n.longitude, DISTANCE_BETWEEN(CAST(t.latitude as STRING),
CAST(t.latitude as STRING),
m.VehicleLocationLatitude, m.VehicleLocationLongitude) as miles,
t.title, t.`description`, t.pubDate, t.latitude, t.longitude,
m.VehicleLocationLatitude, m.VehicleLocationLongitude,
m.StopPointRef, m.VehicleRef,
m.ProgressRate, m.ExpectedDepartureTime, m.StopPoint,
m.VisitNumber, m.DataFrameRef, m.StopPointName,
m.Bearing, m.OriginAimedDepartureTime, m.OperatorRef,
m.DestinationName, m.ExpectedArrivalTime, m.BlockRef,
m.LineRef, m.DirectionRef, m.ArrivalProximityText,
m.DistanceFromStop, m.EstimatedPassengerCapacity,
m.AimedArrivalTime, m.PublishedLineName,
m.ProgressStatus, m.DestinationRef, m.EstimatedPassengerCount,
m.OriginRef, m.NumberOfStopsAway, m.ts
FROM jsonmta /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ m
FULL OUTER JOIN jsontranscom /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ t
ON (t.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3)
AND (t.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3)
AND (t.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3)
AND (t.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3)
FULL OUTER JOIN nytrafficspeed /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ n
ON (n.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3)
AND (n.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3)
AND (n.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3)
AND (n.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3)
WHERE m.VehicleRef is not null
AND t.title is not null
https://medium.com/@tspann/cdc-not-cat-data-capture-e43713879c03
FLaNK for Halifax Canada Transit —
NiFi, Kafka, Flink, SQL, GTFS-RT | by
Tim Spann | Cloudera | Dec, 2023 |
Medium
Never Get Lost in the Stream.
NiFi-Kafka-Flink for getting to work… |
by Tim Spann | Cloudera | Dec, 2023 |
Medium
Iteration 1: Building a System to
Consume All the Real-Time Transit
Data in the World At Once | by Tim
Spann | Cloudera | Medium
Watching Airport Traffic in Real-Time
| by Tim Spann | Cloudera | Medium
52
Resources
https://medium.com/@tspann/llm-pipelines-with-pinecone-and-hu
ggingface-with-python-and-apache-nifi-a96c20be93b7
53
TH N Y U

More Related Content

What's hot (20)

PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
PPTX
Flink SQL & TableAPI in Large Scale Production at Alibaba
DataWorks Summit
 
PPTX
Using Camunda on Kubernetes through Operators
camunda services GmbH
 
PDF
Kogito: cloud native business automation
Mario Fusco
 
PDF
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)
Amazon Web Services Korea
 
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
PDF
[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送
Google Cloud Platform - Japan
 
PPTX
Apache kafka
Kumar Shivam
 
PDF
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
Amazon Web Services Japan
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
HostedbyConfluent
 
PDF
Introducing Vault
Ramit Surana
 
PDF
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
HostedbyConfluent
 
PDF
Apache Airflow
Knoldus Inc.
 
PPTX
Airflow - a data flow engine
Walter Liu
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
DataWorks Summit
 
Using Camunda on Kubernetes through Operators
camunda services GmbH
 
Kogito: cloud native business automation
Mario Fusco
 
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)
Amazon Web Services Korea
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit
 
Apache Flink and what it is used for
Aljoscha Krettek
 
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
[Cloud OnAir] BigQuery へデータを読み込む 2019年3月14日 放送
Google Cloud Platform - Japan
 
Apache kafka
Kumar Shivam
 
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
Amazon Web Services Japan
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
HostedbyConfluent
 
Introducing Vault
Ramit Surana
 
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
HostedbyConfluent
 
Apache Airflow
Knoldus Inc.
 
Airflow - a data flow engine
Walter Liu
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 

Similar to DBA Fundamentals Group: Continuous SQL with Kafka and Flink (20)

PDF
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
PDF
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
Timothy Spann
 
PDF
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
PDF
big data fest building modern data streaming apps
Timothy Spann
 
PDF
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
PDF
RTAS 2023: Building a Real-Time IoT Application
Timothy Spann
 
PDF
Unconference Round Table Notes
Timothy Spann
 
PDF
BigDataFest Building Modern Data Streaming Apps
ssuser73434e
 
PDF
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
PDF
PartnerSkillUp_Enable a Streaming CDC Solution
Timothy Spann
 
PDF
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann
 
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
PDF
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
PDF
Meetup Streaming Data Pipeline Development
Timothy Spann
 
PDF
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
ssuser73434e
 
PDF
Building Real-Time Travel Alerts
Timothy Spann
 
PDF
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PDF
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
PDF
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
Timothy Spann
 
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
big data fest building modern data streaming apps
Timothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
RTAS 2023: Building a Real-Time IoT Application
Timothy Spann
 
Unconference Round Table Notes
Timothy Spann
 
BigDataFest Building Modern Data Streaming Apps
ssuser73434e
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
PartnerSkillUp_Enable a Streaming CDC Solution
Timothy Spann
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
Meetup Streaming Data Pipeline Development
Timothy Spann
 
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
ssuser73434e
 
Building Real-Time Travel Alerts
Timothy Spann
 
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
Ad

Recently uploaded (20)

PDF
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
PPTX
Wondershare Filmora Crack 14.5.18 + Key Full Download [Latest 2025]
HyperPc soft
 
PDF
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
 
PDF
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
PDF
AI Software Development Process, Strategies and Challenges
Net-Craft.com
 
PPTX
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
dheeodoo
 
PDF
Building scalbale cloud native apps with .NET 8
GillesMathieu10
 
PDF
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
 
PPTX
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
 
PPTX
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
PDF
Rewards and Recognition (2).pdf
ethan Talor
 
PPTX
arctitecture application system design os dsa
za241967
 
DOCX
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
PPTX
IDM Crack with Internet Download Manager 6.42 [Latest 2025]
HyperPc soft
 
PDF
Writing Maintainable Playwright Tests with Ease
Shubham Joshi
 
PDF
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 
PPTX
declaration of Variables and constants.pptx
meemee7378
 
PDF
Mastering VPC Architecture Build for Scale from Day 1.pdf
Devseccops.ai
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
Wondershare Filmora Crack 14.5.18 + Key Full Download [Latest 2025]
HyperPc soft
 
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
 
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
AI Software Development Process, Strategies and Challenges
Net-Craft.com
 
ERP Systems in the UAE: Driving Business Transformation with Smart Solutions
dheeodoo
 
Building scalbale cloud native apps with .NET 8
GillesMathieu10
 
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
 
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
 
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
Rewards and Recognition (2).pdf
ethan Talor
 
arctitecture application system design os dsa
za241967
 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
IDM Crack with Internet Download Manager 6.42 [Latest 2025]
HyperPc soft
 
Writing Maintainable Playwright Tests with Ease
Shubham Joshi
 
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 
declaration of Variables and constants.pptx
meemee7378
 
Mastering VPC Architecture Build for Scale from Day 1.pdf
Devseccops.ai
 

DBA Fundamentals Group: Continuous SQL with Kafka and Flink

  • 1. Continuous SQL with Kafka and Flink Tim Spann Principal Developer Advocate February 20, 2024 https://www.meetup.com/dba-fundamentals-group/events/296855261/
  • 3. 3 Tim Spann Twitter: @PaasDev // Blog: datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC, ex-HPE https://medium.com/@tspann https://github.com/tspannhw
  • 4. 4 FLaNK Stack Weekly by Tim Spann This week in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/
  • 5. © 2023 Cloudera, Inc. All rights reserved. 5 Future of Data - NYC + NJ + Philly + Virtual @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  • 8. © 2023 Cloudera, Inc. All rights reserved. 8 BUILDING REAL-TIME REQUIRES A TEAM
  • 9. © 2023 Cloudera, Inc. All rights reserved. 9 Already using Spark? Need NiFi? Need Flink? Want unified Batch/Stream? Want highest Throughput? Don’t need low latency? Large files? Scheduled batches? Replacing Sqoop, ETL Simple JDBC queries? Transform individual records? Want easy development? Lots of small files, events, records, rows? Continuous stream of rows Support many different sources Need Microservices, Batch and Stream? Want high Throughput? Want Low Latency? Want Advanced Windowing and State? Happy with a New Solution that is best-in-class? Spark, NiFi, Flink? Which engine to choose?
  • 10. 10 Live Q&A Travel Advisories Weather Reports Documents Social Media Databases Transactions Public Data Feeds S3 / Files Logs ATM Data Live Chat … HYBRID CLOUD INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
  • 12. 12 Analytics-in-Stream Data Sources Streaming Storage Substrate Cloudera Stream Processing Kafka + NiFi enables real-time ingestion into lakes / analytics services Data Distribution Service Cloudera DataFlow Warehouses & Operational DB Data Lakes & Lake Houses Data-At-Rest Analytics Data Apps Powered by Streaming Insights and used by other Analytics Services Kafka + Flink enables streaming analytics Cloudera Stream Processing Streaming Analytics Low Latency Data Products Data-In-Motion Streaming Analytics Cloudera Edge Flow Edge Ingest
  • 13. 13 Flink Connectors Streaming Data Pipelines with Cloudera Data Platform Edge devices Cloud DB’s SaaS tools Change Data Capture On prem apps/DB’s Streams Any destination Custom Kafka producers Custom Kafka consumers Operational Applications Analytical Applications STREAMING DATA MOVEMENT & PROCESSING STREAMING DATA PROCESSING & ANALYTICS
  • 16. 16
  • 20. © 2023 Cloudera, Inc. All rights reserved. 20 What is Can You Do With Apache Kafka? Web site activity: track page views, searches, etc. in real time Events & log aggregation: particularly in distributed systems where messages come from multiple sources Monitoring and metrics: aggregate statistics from distributed applications and build a dashboard application Stream processing: process raw data, clean it up, and forward it on to another topic or messaging system Real-time data ingestion: fast processing of a very large volume of messages
  • 21. © 2019 Cloudera, Inc. All rights reserved. 21 STREAMS MESSAGING WITH KAFKA • Highly reliable distributed messaging system. • Decouple applications, enables many-to-many patterns. • Publish-Subscribe semantics. • Horizontal scalability. • Efficient implementation to operate at speed with big data volumes. • Organized by topic to support several use cases.
  • 23. 23 CONTINUOUS SQL ● SSB is a Continuous SQL engine ● It’s SQL, but a slightly different mental model, but with big implications Traditional Parse/Execute/Fetch model Continuous SQL Model Hint: The query is boundless and never finishes, and time matters AKA: SELECT * FROM foo WHERE 1=0 -- will run forever
  • 24. 24 Flink SQL -- specify Kafka partition key on output SELECT foo AS _eventKey FROM sensors -- use event time timestamp from kafka -- exactly once compatible SELECT eventTimestamp FROM sensors -- nested structures access SELECT foo.’bar’ FROM table; -- must quote nested column -- timestamps SELECT * FROM payments WHERE eventTimestamp > CURRENT_TIMESTAMP-interval '10' second; -- unnest SELECT b.*, u.* FROM bgp_avro b, UNNEST(b.path) AS u(pathitem) -- aggregations and windows SELECT card, MAX(amount) as theamount, TUMBLE_END(eventTimestamp, interval '5' minute) as ts FROM payments WHERE lat IS NOT NULL AND lon IS NOT NULL GROUP BY card, TUMBLE(eventTimestamp, interval '5' minute) HAVING COUNT(*) > 4 -- >4==fraud -- try to do this ksql! SELECT us_west.user_score+ap_south.user_score FROM kafka_in_zone_us_west us_west FULL OUTER JOIN kafka_in_zone_ap_south ap_south ON us_west.user_id = ap_south.user_id; Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
  • 25. 25 © 2022 Cloudera, Inc. All rights reserved. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 26. 26 SSB MATERIALIZED VIEWS Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
  • 27. © 2019 Cloudera, Inc. All rights reserved. 27 ICEBERG INTEGRATION Robust Next Generation Architecture for Data Driven Business Unified Processing Engine Massive Open table format Iceberg Support for Flink APIs through SSB • Maximally open • Maximally flexible • Ultra high performance for MASSIVE data
  • 29. 29 Continuous SQL select max(alt_baro) as MaxAltitudeFeet, min(alt_baro) as MinAltitudeFeet, avg(alt_baro) as AvgAltitudeFeet, max(alt_geom) as MaxGAltitudeFeet, min(alt_geom) as MinGAltitudeFeet, avg(alt_geom) as AvgGAltitudeFeet, max(gs) as MaxGroundSpeed, min(gs) as MinGroundSpeed, avg(gs) as AvgGroundSpeed, count(alt_baro) as RowCount, hex as ICAO, flight as IDENT from `sr1`.`default_database`.`adsb` group by flight, hex; select transcom.title, transcom.description, mta.VehicleRef, DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) as miles, mta.StopPointName, mta.Bearing, mta.DestinationName, mta.ExpectedArrivalTime, mta.VehicleLocationLatitude, mta.VehicleLocationLongitude, mta.ArrivalProximityText, mta.DistanceFromStop, mta.AimedArrivalTime, mta.`Date`, mta.ts, mta.uuid, mta.EstimatedPassengerCapacity, mta.EstimatedPassengerCount from `schemareg1`.`default_database`.`mta` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ mta FULL OUTER JOIN `schemareg1`.`default_database`.`transcom` /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ transcom ON (transcom.latitude >= CAST(mta.VehicleLocationLatitude as float) - 0.3) AND (transcom.longitude >= CAST(mta.VehicleLocationLongitude as float) - 0.3) AND (transcom.latitude <= CAST(mta.VehicleLocationLatitude as float) + 0.3) AND (transcom.longitude <= CAST(mta.VehicleLocationLongitude as float) + 0.3) WHERE mta.VehicleRef is not null AND transcom.title is not null AND DISTANCE_BETWEEN(CAST(transcom.latitude as STRING), CAST(transcom.latitude as STRING), mta.VehicleLocationLatitude, mta.VehicleLocationLongitude) <= 120
  • 30. 30 Real-time observability pipeline Minfi agents Raw Logs Cloudera Data Flow Cloudera Data Lakehouse Triaging ML Models Threat Hunting Response and Investigation UEBA/Fraud Detection Reports Auto Action Cloudera Streaming Analytics Cybersec Toolkits Parse, Triage, Profile Cloudera Streams Processing Kafka SQL Stream Builder SPLUNK / SIEM /EXTERNAL Cloudera Machine Learning Collect Route/Filter/ Transform Prepare/Analyze/ Alert
  • 33. 33 Data in Motion: Overview e Novidades do NiFi, Kafka e Flink Apresentador: Tim Spann - Principal DIM Specialist and Developer Advocate https://medium.com/cloudera-inc/transit-in-sao-paulo-brasil-flank-style-eaec6753cc63
  • 34. 34
  • 35. 35
  • 36. 36
  • 38. © 2023 Cloudera, Inc. All rights reserved. 38 Already using Kafka? Already using NiFi? Need for Fast Flink? Simple setup for many tables Want metadata augmented data Don’t need low latency? Visual monitoring Easy manual scaling Easy to combine with NiFi Debezium Simple JDBC queries? Transform individual records? Want easy development with UI? Lots of small files, events, records, rows? Continuous stream of rows Support many different sources Debezium coming Strong control of table and joins Want high Throughput? Want Low Latency? Want Advanced Windowing and State? Automatic records immediately Pure SQL Debezium Kafka Connect, NiFi, Flink? Which engine to choose? Or All 3?
  • 39. CDC ARCHITECTURE - Using FLaNK to pull the data out of anything in near-real time INGEST PREPARE PUBLISH DATA SOURCES Internal Users (After Sales) External Systems ENTERPRISE LAKEHOUSE CAPABILITY VIEW INGESTION ENTERPRISE DATA MESSAGE HUB STORAGE BATCH MANAGEMENT STREAM CONSUMPTION Closed Loop Systems SQL Stream Builder Machine Learning Data Visualization Workload Manager watsonx.data
  • 40. 40 Data Distribution as a Universal, Hybrid, Multi-Cloud Data Service Universal Data Distribution Service (Ingest, Transform, Deliver) Ingest Processors Ingest Gateway Router, Filter & Transform Processors Destination Processors Cloud Business Process Services* Log Data Sources Laptops / Servers Security Agents IOT Devices App Logs Mobile Apps Cloud Data Analytics/ Service * On-Prem Data Sources Cloud Warehouse (Cloudera DW) Big Data Cloud Services Multi-Cloud Data Distribution Service that Solves the First & Last Mile Problem for the Modern Data Stack
  • 41. CDC with SQL Stream Builder (Flink SQL)
  • 42. © 2023 Cloudera, Inc. All rights reserved. 42 Streaming CDC with Cloudera SQL Stream Builder (Flink SQL) https://github.com/tspannhw/FLaNK-CDC/blob/main/flinkcdc.MD
  • 43. © 2023 Cloudera, Inc. All rights reserved. 43 https://docs.cloudera.com/csa/1.10.0/how-to-ssb/topics/csa-ssb-cdc-connectors.html CDC with Debezium and Flink SQL Stream Builder with Flink SQL
  • 44. © 2023 Cloudera, Inc. All rights reserved. 44 CDC with Debezium and Flink SQL Stream Builder with Flink SQL
  • 45. © 2023 Cloudera, Inc. All rights reserved. 45
  • 46. © 2023 Cloudera, Inc. All rights reserved. 46 CREATE TABLE `postgres_cdc_newjerseybus` ( `title` STRING, `description` STRING, `link` STRING, `guid` STRING, `advisoryAlert` STRING, `pubDate` STRING, `ts` STRING, `companyname` STRING, `uuid` STRING, `servicename` STRING ) WITH ( 'connector' = 'postgres-cdc', 'database-name' = 'tspann', 'hostname' = '192.168.1.153', 'password' = 'tspann', 'decoding.plugin.name' = 'pgoutput', 'schema-name' = 'public', 'table-name' = 'newjerseybus', 'username' = 'tspann', 'port' = '5432' ); Flink SQL Tables - Debezium CDC From Database Tables
  • 47. © 2023 Cloudera, Inc. All rights reserved. 47 Flink SQL Tables - Upsert to Kafka Topics CREATE TABLE `upsert_kafka_newjerseybus` ( `title` String, `description` String, `link` String, `guid` String, `advisoryAlert` String, `pubDate` String, `ts` String, `companyname` String, `uuid` String, `servicename` String, `eventTimestamp` TIMESTAMP(3), WATERMARK FOR `eventTimestamp` AS `eventTimestamp` - INTERVAL '5' SECOND, PRIMARY KEY (uuid) NOT ENFORCED ) WITH ( 'connector' = 'upsert-kafka', 'topic' = 'kafka_newjerseybus', 'properties.bootstrap.servers' = 'kafka:9092', 'key.format' = 'json', 'value.format' = 'json' );
  • 49. https://github.com/tspannhw/FLaNK-Transit SELECT n.speed, n.travel_time, n.borough, n.link_name, n.link_points, n.latitude, n.longitude, DISTANCE_BETWEEN(CAST(t.latitude as STRING), CAST(t.latitude as STRING), m.VehicleLocationLatitude, m.VehicleLocationLongitude) as miles, t.title, t.`description`, t.pubDate, t.latitude, t.longitude, m.VehicleLocationLatitude, m.VehicleLocationLongitude, m.StopPointRef, m.VehicleRef, m.ProgressRate, m.ExpectedDepartureTime, m.StopPoint, m.VisitNumber, m.DataFrameRef, m.StopPointName, m.Bearing, m.OriginAimedDepartureTime, m.OperatorRef, m.DestinationName, m.ExpectedArrivalTime, m.BlockRef, m.LineRef, m.DirectionRef, m.ArrivalProximityText, m.DistanceFromStop, m.EstimatedPassengerCapacity, m.AimedArrivalTime, m.PublishedLineName, m.ProgressStatus, m.DestinationRef, m.EstimatedPassengerCount, m.OriginRef, m.NumberOfStopsAway, m.ts FROM jsonmta /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ m FULL OUTER JOIN jsontranscom /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ t ON (t.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3) AND (t.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3) AND (t.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3) AND (t.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3) FULL OUTER JOIN nytrafficspeed /*+ OPTIONS('scan.startup.mode' = 'earliest-offset') */ n ON (n.latitude >= CAST(m.VehicleLocationLatitude as float) - 0.3) AND (n.longitude >= CAST(m.VehicleLocationLongitude as float) - 0.3) AND (n.latitude <= CAST(m.VehicleLocationLatitude as float) + 0.3) AND (n.longitude <= CAST(m.VehicleLocationLongitude as float) + 0.3) WHERE m.VehicleRef is not null AND t.title is not null
  • 51. FLaNK for Halifax Canada Transit — NiFi, Kafka, Flink, SQL, GTFS-RT | by Tim Spann | Cloudera | Dec, 2023 | Medium Never Get Lost in the Stream. NiFi-Kafka-Flink for getting to work… | by Tim Spann | Cloudera | Dec, 2023 | Medium Iteration 1: Building a System to Consume All the Real-Time Transit Data in the World At Once | by Tim Spann | Cloudera | Medium Watching Airport Traffic in Real-Time | by Tim Spann | Cloudera | Medium