SlideShare a Scribd company logo
ETL is Dead.
Long Live Streams with Apache Kafka.
Taras Kloba,
BI Team Lead/Data Architect,
Intellias
Agenda
• About me;
• One problem in data transferring;
• Ways to solve this problem;
• About Apache Kafka;
• Demo of reliable data sending;
• Questions?
Taras Kloba
• 7 years of experience with databases;
• Certified Data Engineer on Google Cloud;
• Certified Expert Microsoft SQL Server;
• Co-organizer “SQL Saturday” in Lviv and Krakow;
• Trainer, speaker, consultant;
• Owner “SQL” trademark in Ukraine .
SQL.ua,
CEO/Founder
Intellias,
BI Team Lead/Data Architect
Quick facts
(Q62JCJRJGY77)(9DG5NZ4EVP7A) (M2HE6LPRJ6MV)
My current project:
One of the biggest B2B software solution
for the iGaming industry in the World.
+300 GB new
data
every day
Previous legacy system
00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
00:00:00’ AND ’2018-11-03 00:04:00’
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
00:04:00’ AND ’2018-11-03 00:08:00’
?
Previous legacy system
00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
00:00:00’ AND ’2018-11-03 00:04:00’
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
00:02:00’ AND ’2018-11-03 00:06:00’
?
Phantom reads (classical definition)
Phantom reads (in our cases)
Tnx: 1
2018-11-03
12:00:00
Tnx: 2
2018-11-03
12:01:00
Tnx: 2
commit
12:03:00
Tnx: 1
commit
12:05:00
SELECT *
FROM fact_transactions
WHERE upd BETWEEN ’2018-11-03
11:58:00’ AND ’2018-11-03 12:04:00’
Trans_id Upd
2
2018-11-03
12:01:00
?
#1. Isolation
levels -
Serializable
With a lock-based concurrency control
DBMS implementation, serializability
requires read and write locks (acquired on
selected data) to be released at the end of
the transaction. Also range-locks must be
acquired when a SELECT query uses a
ranged WHERE clause, especially to avoid
the phantom reads phenomenon.
A not best solution for high
load solutions.
#2. Triggers
Traditionally, the most common technique
used for capturing events was to use
database or application-level triggers. The
reason why this technique is still very
widespread is due to its simplicity and
familiarness. A not best solution for high
load solutions.
#3. Change
Data Capture
is a set of software design patterns used to
determine (and track) the data that has
changed so that action can be taken using
the changed data. Also, Change data
capture (CDC) is an approach to data
integration that is based on the identification,
capture and delivery of the changes made to
enterprise data sources.
(Wikipedia)
CDC solutions occur most often in data-
warehouse environments since capturing and
preserving the state of data across time is one of
the core functions of a data warehouse.
Apache Kafka
Kafka® is used for building real-time data
pipelines and streaming apps. It is
horizontally scalable, fault-tolerant, wicked
fast, and runs in production in thousands of
companies.
Тарас Кльоба  "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka"
Typical data flow in companies
Streaming platform to coordinate all data flows.
Kafka Connect API (E and L in Streaming ETL)
• Scalability: Leverages Kafka for
scalability
• Fault tolerance: Builds on Kafka’s
fault tolerance model
• Management and monitoring: One
way of monitoring all connectors
• Schemas: Offers an option for
preserving schemas from source to
sink
Kafka Connect. Create new connector.
Kafka’s streams API (The T in ETL)
• Easiest way to do stream
processing using Kafka;
• True event-at-a-time stream “
processing; no microbatching;
• Dataflow-style windowing
based on “ event-time; handles
late-arriving data
Kafka Stream API. Create new processor
Demo
Conclusion
• Apache Kafka is robust
• Triggers will keep your data in sync but can have significant performance
overhead
• Utilizing a logical replication slot can eliminate trigger overhead and transfer the
computation load elsewhere
• Not a panacea: still need to use good architectural patterns
Questions?
Thank you!
Taras Klioba
+38 093 74 876 15
taras@klioba.com

More Related Content

PDF
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
PPTX
Disrupting Big Data with Apache Spark in the Cloud
PDF
Time Series Analysis Using an Event Streaming Platform
PDF
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
PDF
Enterprise Metadata Integration
PPTX
Realtime streaming architecture in INFINARIO
PPTX
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
PDF
Observability for Data Pipelines With OpenLineage
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
Disrupting Big Data with Apache Spark in the Cloud
Time Series Analysis Using an Event Streaming Platform
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Enterprise Metadata Integration
Realtime streaming architecture in INFINARIO
From Spark to Ignition: Fueling Your Business on Real-Time Analytics
Observability for Data Pipelines With OpenLineage

What's hot (20)

PPTX
Getting It Right Exactly Once: Principles for Streaming Architectures
PDF
Data Pipelines With Streamsets
PPTX
The evolution of the big data platform @ Netflix (OSCON 2015)
PPTX
Streaming Data Ingest and Processing with Apache Kafka
PDF
The Future of ETL Isn't What It Used to Be
PPTX
Dealing with Drift: Building an Enterprise Data Lake
PDF
Open Source DataViz with Apache Superset
PDF
Using Hazelcast in the Kappa architecture
PPTX
Internet of Things and Multi-model Data Infrastructure
PDF
Journey to the Real-Time Analytics in Extreme Growth
PDF
The Lyft data platform: Now and in the future
PPTX
Spark Summit Keynote by Suren Nathan
PPTX
Real-Time Geospatial Intelligence at Scale
PDF
Insights Without Tradeoffs: Using Structured Streaming
PDF
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
PDF
Converging Database Transactions and Analytics
PDF
InfoTrack: Creating a single source of truth with the Elastic Stack
PPTX
See who is using MemSQL
PPTX
How Kafka and Modern Databases Benefit Apps and Analytics
PPTX
Modeling the Smart and Connected City of the Future with Kafka and Spark
Getting It Right Exactly Once: Principles for Streaming Architectures
Data Pipelines With Streamsets
The evolution of the big data platform @ Netflix (OSCON 2015)
Streaming Data Ingest and Processing with Apache Kafka
The Future of ETL Isn't What It Used to Be
Dealing with Drift: Building an Enterprise Data Lake
Open Source DataViz with Apache Superset
Using Hazelcast in the Kappa architecture
Internet of Things and Multi-model Data Infrastructure
Journey to the Real-Time Analytics in Extreme Growth
The Lyft data platform: Now and in the future
Spark Summit Keynote by Suren Nathan
Real-Time Geospatial Intelligence at Scale
Insights Without Tradeoffs: Using Structured Streaming
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Converging Database Transactions and Analytics
InfoTrack: Creating a single source of truth with the Elastic Stack
See who is using MemSQL
How Kafka and Modern Databases Benefit Apps and Analytics
Modeling the Smart and Connected City of the Future with Kafka and Spark
Ad

Similar to Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka" (20)

PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PDF
Etl is Dead; Long Live Streams
PDF
Apache Spark Presentation good for big data
PPTX
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
PPTX
Streaming Data and Stream Processing with Apache Kafka
PPTX
Databricks Platform.pptx
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
PDF
Leveraging Mainframe Data for Modern Analytics
PDF
Time's Up! Getting Value from Big Data Now
PDF
Dev Ops Training
PDF
BBL KAPPA Lesfurets.com
PDF
Building real time data-driven products
PDF
ETL Is Dead, Long-live Streams
PPTX
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
PDF
Spark + AI Summit 2020 イベント概要
PDF
Confluent and Elastic
PDF
Cloud Lambda Architecture Patterns
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
PDF
GSJUG: Mastering Data Streaming Pipelines 09May2023
PDF
Building Event Streaming Architectures on Scylla and Kafka
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Etl is Dead; Long Live Streams
Apache Spark Presentation good for big data
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Streaming Data and Stream Processing with Apache Kafka
Databricks Platform.pptx
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Leveraging Mainframe Data for Modern Analytics
Time's Up! Getting Value from Big Data Now
Dev Ops Training
BBL KAPPA Lesfurets.com
Building real time data-driven products
ETL Is Dead, Long-live Streams
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Spark + AI Summit 2020 イベント概要
Confluent and Elastic
Cloud Lambda Architecture Patterns
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
GSJUG: Mastering Data Streaming Pipelines 09May2023
Building Event Streaming Architectures on Scylla and Kafka
Ad

More from Lviv Startup Club (20)

PDF
Maksym Vyshnivetskyi: PMO KPIs (UA) - LemBS
PDF
Oleksandr Ivakhnenko: LinkedIn Marketing і Content Marketing: розширений підх...
PDF
Maksym Vyshnivetskyi: PMO Quality Management (UA)
PDF
Oleksandr Ivakhnenko: Вступ до генерації лідів для ІТ-аутсорсингу (UA)
PDF
Oleksandr Osypenko: Поради щодо іспиту та закриття курсу (UA)
PDF
Oleksandr Osypenko: Пробний іспит + аналіз (UA)
PDF
Oleksandr Osypenko: Agile / Hybrid Delivery (UA)
PDF
Oleksandr Osypenko: Стейкхолдери та їх вплив (UA)
PDF
Rostyslav Chayka: Prompt Engineering для проєктного менеджменту (Advanced) (UA)
PPTX
Dmytro Liesov: PMO Tools and Technologies (UA)
PDF
Rostyslav Chayka: Управління командою за допомогою AI (UA)
PDF
Oleksandr Osypenko: Tailoring + Change Management (UA)
PDF
Maksym Vyshnivetskyi: Управління закупівлями (UA)
PDF
Oleksandr Osypenko: Управління ризиками (UA)
PPTX
Dmytro Zubkov: PMO Resource Management (UA)
PPTX
Rostyslav Chayka: Комунікація за допомогою AI (UA)
PDF
Ihor Pavlenko: Комунікація за допомогою AI (UA)
PDF
Maksym Vyshnivetskyi: Управління якістю (UA)
PDF
Ihor Pavlenko: Робота зі стейкхолдерами за допомогою AI (UA)
PDF
Maksym Vyshnivetskyi: Управління вартістю (Cost) (UA)
Maksym Vyshnivetskyi: PMO KPIs (UA) - LemBS
Oleksandr Ivakhnenko: LinkedIn Marketing і Content Marketing: розширений підх...
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Oleksandr Ivakhnenko: Вступ до генерації лідів для ІТ-аутсорсингу (UA)
Oleksandr Osypenko: Поради щодо іспиту та закриття курсу (UA)
Oleksandr Osypenko: Пробний іспит + аналіз (UA)
Oleksandr Osypenko: Agile / Hybrid Delivery (UA)
Oleksandr Osypenko: Стейкхолдери та їх вплив (UA)
Rostyslav Chayka: Prompt Engineering для проєктного менеджменту (Advanced) (UA)
Dmytro Liesov: PMO Tools and Technologies (UA)
Rostyslav Chayka: Управління командою за допомогою AI (UA)
Oleksandr Osypenko: Tailoring + Change Management (UA)
Maksym Vyshnivetskyi: Управління закупівлями (UA)
Oleksandr Osypenko: Управління ризиками (UA)
Dmytro Zubkov: PMO Resource Management (UA)
Rostyslav Chayka: Комунікація за допомогою AI (UA)
Ihor Pavlenko: Комунікація за допомогою AI (UA)
Maksym Vyshnivetskyi: Управління якістю (UA)
Ihor Pavlenko: Робота зі стейкхолдерами за допомогою AI (UA)
Maksym Vyshnivetskyi: Управління вартістю (Cost) (UA)

Recently uploaded (20)

PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Sustainable Sites - Green Building Construction
PDF
737-MAX_SRG.pdf student reference guides
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
PPT on Performance Review to get promotions
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPT
Mechanical Engineering MATERIALS Selection
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
Artificial Intelligence
PPT
Total quality management ppt for engineering students
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Fundamentals of Mechanical Engineering.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Sustainable Sites - Green Building Construction
737-MAX_SRG.pdf student reference guides
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT on Performance Review to get promotions
UNIT 4 Total Quality Management .pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
III.4.1.2_The_Space_Environment.p pdffdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Mechanical Engineering MATERIALS Selection
Categorization of Factors Affecting Classification Algorithms Selection
Artificial Intelligence
Total quality management ppt for engineering students
Foundation to blockchain - A guide to Blockchain Tech
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems

Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache Kafka"

  • 1. ETL is Dead. Long Live Streams with Apache Kafka. Taras Kloba, BI Team Lead/Data Architect, Intellias
  • 2. Agenda • About me; • One problem in data transferring; • Ways to solve this problem; • About Apache Kafka; • Demo of reliable data sending; • Questions?
  • 3. Taras Kloba • 7 years of experience with databases; • Certified Data Engineer on Google Cloud; • Certified Expert Microsoft SQL Server; • Co-organizer “SQL Saturday” in Lviv and Krakow; • Trainer, speaker, consultant; • Owner “SQL” trademark in Ukraine . SQL.ua, CEO/Founder Intellias, BI Team Lead/Data Architect Quick facts (Q62JCJRJGY77)(9DG5NZ4EVP7A) (M2HE6LPRJ6MV)
  • 4. My current project: One of the biggest B2B software solution for the iGaming industry in the World. +300 GB new data every day
  • 5. Previous legacy system 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09 SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:00:00’ AND ’2018-11-03 00:04:00’ SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:04:00’ AND ’2018-11-03 00:08:00’
  • 6. ?
  • 7. Previous legacy system 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09 SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:00:00’ AND ’2018-11-03 00:04:00’ SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:02:00’ AND ’2018-11-03 00:06:00’
  • 8. ?
  • 10. Phantom reads (in our cases) Tnx: 1 2018-11-03 12:00:00 Tnx: 2 2018-11-03 12:01:00 Tnx: 2 commit 12:03:00 Tnx: 1 commit 12:05:00 SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 11:58:00’ AND ’2018-11-03 12:04:00’ Trans_id Upd 2 2018-11-03 12:01:00
  • 11. ?
  • 12. #1. Isolation levels - Serializable With a lock-based concurrency control DBMS implementation, serializability requires read and write locks (acquired on selected data) to be released at the end of the transaction. Also range-locks must be acquired when a SELECT query uses a ranged WHERE clause, especially to avoid the phantom reads phenomenon. A not best solution for high load solutions.
  • 13. #2. Triggers Traditionally, the most common technique used for capturing events was to use database or application-level triggers. The reason why this technique is still very widespread is due to its simplicity and familiarness. A not best solution for high load solutions.
  • 14. #3. Change Data Capture is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources. (Wikipedia) CDC solutions occur most often in data- warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse.
  • 15. Apache Kafka Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
  • 17. Typical data flow in companies
  • 18. Streaming platform to coordinate all data flows.
  • 19. Kafka Connect API (E and L in Streaming ETL) • Scalability: Leverages Kafka for scalability • Fault tolerance: Builds on Kafka’s fault tolerance model • Management and monitoring: One way of monitoring all connectors • Schemas: Offers an option for preserving schemas from source to sink
  • 20. Kafka Connect. Create new connector.
  • 21. Kafka’s streams API (The T in ETL) • Easiest way to do stream processing using Kafka; • True event-at-a-time stream “ processing; no microbatching; • Dataflow-style windowing based on “ event-time; handles late-arriving data
  • 22. Kafka Stream API. Create new processor
  • 23. Demo
  • 24. Conclusion • Apache Kafka is robust • Triggers will keep your data in sync but can have significant performance overhead • Utilizing a logical replication slot can eliminate trigger overhead and transfer the computation load elsewhere • Not a panacea: still need to use good architectural patterns