SlideShare a Scribd company logo
Queue with
asyncio and Kafka
Showcase
Ondřej Veselý
What kind of data we have
Problem:
store JSON to database
Just a few records
per second.
But
● Slow database
● Unreliable database
● Increasing traffic (20x)
def save_data(conn, cur, ts, data):
cur.execute(
"""INSERT INTO data (timestamp, data)
VALUES (%s,%s) """, (ts, ujson.dumps(data)))
conn.commit()
@app.route('/store', method=['PUT', 'POST'])
def logstash_route():
data = ujson.load(request.body)
conn = psycopg2.connect(**config.pg_logs)
t = datetime.now()
with conn.cursor(cursor_factory=DictCursor) as cur:
for d in data:
save_data(conn, cur, t, d)
conn.close()
Old code
Architecture
internet
Kafka producer
/store
Kafka consumer
Kafka queue
Postgres
… time to kill consumer ...
Asyncio, example
import asyncio
async def factorial(name, number):
f = 1
for i in range(2, number+1):
print("Task %s: Compute factorial(%s)..." % (name, i))
await asyncio.sleep(1)
f *= i
print("Task %s: factorial(%s) = %s" % (name, number, f))
loop = asyncio.get_event_loop()
tasks = [
asyncio.ensure_future(factorial("A", 2)),
asyncio.ensure_future(factorial("B", 3)),
asyncio.ensure_future(factorial("C", 4))]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
Task A: Compute factorial(2)...
Task B: Compute factorial(2)...
Task C: Compute factorial(2)...
Task A: factorial(2) = 2
Task B: Compute factorial(3)...
Task C: Compute factorial(3)...
Task B: factorial(3) = 6
Task C: Compute factorial(4)...
Task C: factorial(4) = 24
What we used
Apache Kafka
Not ujson
Concurrency - doing lots of slow things at once.
No processes, no threads.
Producer
from aiohttp import web
import json
Consumer
import asyncio
import json
from aiokafka import AIOKafkaConsumer
import aiopg
Producer #1
async def kafka_send(kafka_producer, data, topic):
message = {
'data': data,
'received': str(arrow.utcnow())
}
message_json_bytes = bytes(json.dumps(message), 'utf-8')
await kafka_producer.send_and_wait(topic, message_json_bytes)
async def handle(request):
post_data = await request.json()
try:
await kafka_send(request.app['kafka_p'], post_data, topic=settings.topic)
except:
slog.exception("Kafka Error")
await destroy_all()
return web.Response(status=200)
app = web.Application()
app.router.add_route('POST', '/store', handle)
app['kafka_p'] = get_kafka_producer()
Destroying the loop
async def destroy_all():
loop = asyncio.get_event_loop()
for task in asyncio.Task.all_tasks():
task.cancel()
await loop.stop()
await loop.close()
slog.debug("Exiting.")
sys.exit()
def get_kafka_producer():
loop = asyncio.get_event_loop()
producer = AIOKafkaProducer(
loop=loop,
bootstrap_servers=settings.queues_urls,
request_timeout_ms=settings.kafka_timeout,
retry_backoff_ms=1000)
loop.run_until_complete(producer.start())
return producer
Getting producer
Producer #2
Consume
… time to resurrect consumer ...
DB
connected
1. Receive data record from Kafka
2. Put it to the queue
start
yesno
Flush
queue full enough
or
data old enough
Store data from queue to DB
yesno
Connect to DB
start
asyncio.Queue()
Consumer #1
def main():
dbs_connected = asyncio.Future()
batch = asyncio.Queue(maxsize=settings.batch_max_size)
asyncio.ensure_future(consume(batch, dbs_connected))
asyncio.ensure_future(start_flushing(batch, dbs_connected))
loop.run_forever()
async def consume(queue, dbs_connected):
await asyncio.wait_for(dbs_connected, timeout=settings.wait_for_databases)
consumer = AIOKafkaConsumer(
settings.topic, loop=loop, bootstrap_servers=settings.queues_urls,
group_id='consumers'
)
await consumer.start()
async for msg in consumer:
message = json.loads(msg.value.decode("utf-8"))
await queue.put((message.get('received'), message.get('data')))
await consumer.stop()
Consumer #2
async def start_flushing(queue, dbs_connected):
db_logg = await aiopg.create_pool(settings.logs_db_url)
while True:
async with db_logg.acquire() as logg_conn, logg_conn.cursor() as logg_cur:
await keep_flushing(dbs_connected, logg_cur, queue)
await asyncio.sleep(2)
async def keep_flushing(dbs_connected, logg_cur, queue):
dbs_connected.set_result(True)
last_stored_time = time.time()
while True:
if not queue.empty() and (queue.qsize() > settings.batch_flush_size or
time.time() - last_stored_time > settings.batch_max_time):
to_store = []
while not queue.empty():
to_store.append(await queue.get())
try:
await store_bulk(logg_cur, to_store)
except:
break # DB down, breaking to reconnect
last_stored_time = time.time()
await asyncio.sleep(settings.batch_sleep)
Consumer #3
Code is public on gitlab
https://gitlab.skypicker.com/ondrej/faqstorer
www.orwen.org
code.kiwi.com
www.kiwi.com/jobs/
Check graphs...
Ad

Recommended

Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Neo4j
 
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
HostedbyConfluent
 
Introduction to Traefik
Introduction to Traefik
Giovanni Toraldo
 
An overview of Amazon Athena
An overview of Amazon Athena
Julien SIMON
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
ScyllaDB
 
사회인의 휴학, Gap Year 이야기
사회인의 휴학, Gap Year 이야기
Seongyun Byeon
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Social Network Analysis power point presentation
Social Network Analysis power point presentation
Ratnesh Shah
 
Postgresql tutorial
Postgresql tutorial
Ashoka Vanjare
 
Postgresql 12 streaming replication hol
Postgresql 12 streaming replication hol
Vijay Kumar N
 
Clean architectures with fast api pycones
Clean architectures with fast api pycones
Alvaro Del Castillo
 
Neo4j Fundamentals
Neo4j Fundamentals
Max De Marzi
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
ElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
EDB
 
NDC 2018 디지털 신원 확인 101
NDC 2018 디지털 신원 확인 101
tcaesvk
 
Storing 16 Bytes at Scale
Storing 16 Bytes at Scale
Fabian Reinartz
 
Crash course intro to cassandra
Crash course intro to cassandra
Jon Haddad
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
[Main Session] 카프카, 데이터 플랫폼의 최강자
[Main Session] 카프카, 데이터 플랫폼의 최강자
Oracle Korea
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
Azure Synapse Analytics
Azure Synapse Analytics
WinWire Technologies Inc
 
マイクロサービスに必要な技術要素はすべてSpring Cloudにある #DO07
マイクロサービスに必要な技術要素はすべてSpring Cloudにある #DO07
Toshiaki Maki
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
美团技术沙龙04 - Kv Tair best practise
美团技术沙龙04 - Kv Tair best practise
美团点评技术团队
 

More Related Content

What's hot (20)

Social Network Analysis power point presentation
Social Network Analysis power point presentation
Ratnesh Shah
 
Postgresql tutorial
Postgresql tutorial
Ashoka Vanjare
 
Postgresql 12 streaming replication hol
Postgresql 12 streaming replication hol
Vijay Kumar N
 
Clean architectures with fast api pycones
Clean architectures with fast api pycones
Alvaro Del Castillo
 
Neo4j Fundamentals
Neo4j Fundamentals
Max De Marzi
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
ElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
EDB
 
NDC 2018 디지털 신원 확인 101
NDC 2018 디지털 신원 확인 101
tcaesvk
 
Storing 16 Bytes at Scale
Storing 16 Bytes at Scale
Fabian Reinartz
 
Crash course intro to cassandra
Crash course intro to cassandra
Jon Haddad
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
[Main Session] 카프카, 데이터 플랫폼의 최강자
[Main Session] 카프카, 데이터 플랫폼의 최강자
Oracle Korea
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
Azure Synapse Analytics
Azure Synapse Analytics
WinWire Technologies Inc
 
マイクロサービスに必要な技術要素はすべてSpring Cloudにある #DO07
マイクロサービスに必要な技術要素はすべてSpring Cloudにある #DO07
Toshiaki Maki
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Social Network Analysis power point presentation
Social Network Analysis power point presentation
Ratnesh Shah
 
Postgresql 12 streaming replication hol
Postgresql 12 streaming replication hol
Vijay Kumar N
 
Clean architectures with fast api pycones
Clean architectures with fast api pycones
Alvaro Del Castillo
 
Neo4j Fundamentals
Neo4j Fundamentals
Max De Marzi
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
ElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
EDB
 
NDC 2018 디지털 신원 확인 101
NDC 2018 디지털 신원 확인 101
tcaesvk
 
Storing 16 Bytes at Scale
Storing 16 Bytes at Scale
Fabian Reinartz
 
Crash course intro to cassandra
Crash course intro to cassandra
Jon Haddad
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
[Main Session] 카프카, 데이터 플랫폼의 최강자
[Main Session] 카프카, 데이터 플랫폼의 최강자
Oracle Korea
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
マイクロサービスに必要な技術要素はすべてSpring Cloudにある #DO07
マイクロサービスに必要な技術要素はすべてSpring Cloudにある #DO07
Toshiaki Maki
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 

Viewers also liked (7)

codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
美团技术沙龙04 - Kv Tair best practise
美团技术沙龙04 - Kv Tair best practise
美团点评技术团队
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
guest61205606
 
Inter-Process Communication in distributed systems
Inter-Process Communication in distributed systems
Aya Mahmoud
 
Synchronization in distributed systems
Synchronization in distributed systems
SHATHAN
 
大数据时代feed架构 (ArchSummit Beijing 2014)
大数据时代feed架构 (ArchSummit Beijing 2014)
Tim Y
 
[email protected]
[email protected]
Tailor Cai
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
美团技术沙龙04 - Kv Tair best practise
美团技术沙龙04 - Kv Tair best practise
美团点评技术团队
 
Communication And Synchronization In Distributed Systems
Communication And Synchronization In Distributed Systems
guest61205606
 
Inter-Process Communication in distributed systems
Inter-Process Communication in distributed systems
Aya Mahmoud
 
Synchronization in distributed systems
Synchronization in distributed systems
SHATHAN
 
大数据时代feed架构 (ArchSummit Beijing 2014)
大数据时代feed架构 (ArchSummit Beijing 2014)
Tim Y
 
Ad

Similar to Python queue solution with asyncio and kafka (20)

Introduction to asyncio
Introduction to asyncio
Saúl Ibarra Corretgé
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more Reactive
Jeff Smith
 
Future Decoded - Node.js per sviluppatori .NET
Future Decoded - Node.js per sviluppatori .NET
Gianluca Carucci
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made Simple
Ian Barber
 
Asynchronous web apps with the Play Framework 2.0
Asynchronous web apps with the Play Framework 2.0
Oscar Renalias
 
JS Fest 2019 Node.js Antipatterns
JS Fest 2019 Node.js Antipatterns
Timur Shemsedinov
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Sages
 
Websockets talk at Rubyconf Uruguay 2010
Websockets talk at Rubyconf Uruguay 2010
Ismael Celis
 
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
tdc-globalcode
 
Refactoring to Macros with Clojure
Refactoring to Macros with Clojure
Dmitry Buzdin
 
Lego: A brick system build by scala
Lego: A brick system build by scala
lunfu zhong
 
Think Async: Asynchronous Patterns in NodeJS
Think Async: Asynchronous Patterns in NodeJS
Adam L Barrett
 
Writing Redis in Python with asyncio
Writing Redis in Python with asyncio
James Saryerwinnie
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData
 
Rntb20200805
Rntb20200805
t k
 
Avoiding Callback Hell with Async.js
Avoiding Callback Hell with Async.js
cacois
 
Stream or not to Stream?

Stream or not to Stream?

Lukasz Byczynski
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015
Leonardo Borges
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more Reactive
Jeff Smith
 
Future Decoded - Node.js per sviluppatori .NET
Future Decoded - Node.js per sviluppatori .NET
Gianluca Carucci
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made Simple
Ian Barber
 
Asynchronous web apps with the Play Framework 2.0
Asynchronous web apps with the Play Framework 2.0
Oscar Renalias
 
JS Fest 2019 Node.js Antipatterns
JS Fest 2019 Node.js Antipatterns
Timur Shemsedinov
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Sages
 
Websockets talk at Rubyconf Uruguay 2010
Websockets talk at Rubyconf Uruguay 2010
Ismael Celis
 
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
tdc-globalcode
 
Refactoring to Macros with Clojure
Refactoring to Macros with Clojure
Dmitry Buzdin
 
Lego: A brick system build by scala
Lego: A brick system build by scala
lunfu zhong
 
Think Async: Asynchronous Patterns in NodeJS
Think Async: Asynchronous Patterns in NodeJS
Adam L Barrett
 
Writing Redis in Python with asyncio
Writing Redis in Python with asyncio
James Saryerwinnie
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData
 
Rntb20200805
Rntb20200805
t k
 
Avoiding Callback Hell with Async.js
Avoiding Callback Hell with Async.js
cacois
 
Stream or not to Stream?

Stream or not to Stream?

Lukasz Byczynski
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Futures e abstração - QCon São Paulo 2015
Futures e abstração - QCon São Paulo 2015
Leonardo Borges
 
Ad

Recently uploaded (20)

Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
MCB Internship report for the year of 2025
MCB Internship report for the year of 2025
PakistanPrinting
 
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
Ameya Patekar
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
benediktnetzer1
 
Untitled presentation xcvxcvxcvxcvx.pptx
Untitled presentation xcvxcvxcvxcvx.pptx
jonathan4241
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
最新版美国加利福尼亚大学旧金山法学院毕业证(UCLawSF毕业证书)定制
最新版美国加利福尼亚大学旧金山法学院毕业证(UCLawSF毕业证书)定制
taqyea
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
presentation4.pdf Intro to mcmc methodss
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
Introduction for GenAI for Faculty for University.pdf
Introduction for GenAI for Faculty for University.pdf
Saeed999312
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
reporting monthly for genset & Air Compressor.pptx
reporting monthly for genset & Air Compressor.pptx
dacripapanjaitan
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
MCB Internship report for the year of 2025
MCB Internship report for the year of 2025
PakistanPrinting
 
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
Ameya Patekar
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
benediktnetzer1
 
Untitled presentation xcvxcvxcvxcvx.pptx
Untitled presentation xcvxcvxcvxcvx.pptx
jonathan4241
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
最新版美国加利福尼亚大学旧金山法学院毕业证(UCLawSF毕业证书)定制
最新版美国加利福尼亚大学旧金山法学院毕业证(UCLawSF毕业证书)定制
taqyea
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
presentation4.pdf Intro to mcmc methodss
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
Introduction for GenAI for Faculty for University.pdf
Introduction for GenAI for Faculty for University.pdf
Saeed999312
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
reporting monthly for genset & Air Compressor.pptx
reporting monthly for genset & Air Compressor.pptx
dacripapanjaitan
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
Presentation by Tariq & Mohammed (1).pptx
Presentation by Tariq & Mohammed (1).pptx
AbooddSandoqaa
 
Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 

Python queue solution with asyncio and kafka

  • 1. Queue with asyncio and Kafka Showcase Ondřej Veselý
  • 2. What kind of data we have
  • 3. Problem: store JSON to database Just a few records per second. But ● Slow database ● Unreliable database ● Increasing traffic (20x)
  • 4. def save_data(conn, cur, ts, data): cur.execute( """INSERT INTO data (timestamp, data) VALUES (%s,%s) """, (ts, ujson.dumps(data))) conn.commit() @app.route('/store', method=['PUT', 'POST']) def logstash_route(): data = ujson.load(request.body) conn = psycopg2.connect(**config.pg_logs) t = datetime.now() with conn.cursor(cursor_factory=DictCursor) as cur: for d in data: save_data(conn, cur, t, d) conn.close() Old code
  • 5. Architecture internet Kafka producer /store Kafka consumer Kafka queue Postgres … time to kill consumer ...
  • 6. Asyncio, example import asyncio async def factorial(name, number): f = 1 for i in range(2, number+1): print("Task %s: Compute factorial(%s)..." % (name, i)) await asyncio.sleep(1) f *= i print("Task %s: factorial(%s) = %s" % (name, number, f)) loop = asyncio.get_event_loop() tasks = [ asyncio.ensure_future(factorial("A", 2)), asyncio.ensure_future(factorial("B", 3)), asyncio.ensure_future(factorial("C", 4))] loop.run_until_complete(asyncio.gather(*tasks)) loop.close() Task A: Compute factorial(2)... Task B: Compute factorial(2)... Task C: Compute factorial(2)... Task A: factorial(2) = 2 Task B: Compute factorial(3)... Task C: Compute factorial(3)... Task B: factorial(3) = 6 Task C: Compute factorial(4)... Task C: factorial(4) = 24
  • 7. What we used Apache Kafka Not ujson Concurrency - doing lots of slow things at once. No processes, no threads. Producer from aiohttp import web import json Consumer import asyncio import json from aiokafka import AIOKafkaConsumer import aiopg
  • 8. Producer #1 async def kafka_send(kafka_producer, data, topic): message = { 'data': data, 'received': str(arrow.utcnow()) } message_json_bytes = bytes(json.dumps(message), 'utf-8') await kafka_producer.send_and_wait(topic, message_json_bytes) async def handle(request): post_data = await request.json() try: await kafka_send(request.app['kafka_p'], post_data, topic=settings.topic) except: slog.exception("Kafka Error") await destroy_all() return web.Response(status=200) app = web.Application() app.router.add_route('POST', '/store', handle) app['kafka_p'] = get_kafka_producer()
  • 9. Destroying the loop async def destroy_all(): loop = asyncio.get_event_loop() for task in asyncio.Task.all_tasks(): task.cancel() await loop.stop() await loop.close() slog.debug("Exiting.") sys.exit() def get_kafka_producer(): loop = asyncio.get_event_loop() producer = AIOKafkaProducer( loop=loop, bootstrap_servers=settings.queues_urls, request_timeout_ms=settings.kafka_timeout, retry_backoff_ms=1000) loop.run_until_complete(producer.start()) return producer Getting producer Producer #2
  • 10. Consume … time to resurrect consumer ... DB connected 1. Receive data record from Kafka 2. Put it to the queue start yesno Flush queue full enough or data old enough Store data from queue to DB yesno Connect to DB start asyncio.Queue() Consumer #1
  • 11. def main(): dbs_connected = asyncio.Future() batch = asyncio.Queue(maxsize=settings.batch_max_size) asyncio.ensure_future(consume(batch, dbs_connected)) asyncio.ensure_future(start_flushing(batch, dbs_connected)) loop.run_forever() async def consume(queue, dbs_connected): await asyncio.wait_for(dbs_connected, timeout=settings.wait_for_databases) consumer = AIOKafkaConsumer( settings.topic, loop=loop, bootstrap_servers=settings.queues_urls, group_id='consumers' ) await consumer.start() async for msg in consumer: message = json.loads(msg.value.decode("utf-8")) await queue.put((message.get('received'), message.get('data'))) await consumer.stop() Consumer #2
  • 12. async def start_flushing(queue, dbs_connected): db_logg = await aiopg.create_pool(settings.logs_db_url) while True: async with db_logg.acquire() as logg_conn, logg_conn.cursor() as logg_cur: await keep_flushing(dbs_connected, logg_cur, queue) await asyncio.sleep(2) async def keep_flushing(dbs_connected, logg_cur, queue): dbs_connected.set_result(True) last_stored_time = time.time() while True: if not queue.empty() and (queue.qsize() > settings.batch_flush_size or time.time() - last_stored_time > settings.batch_max_time): to_store = [] while not queue.empty(): to_store.append(await queue.get()) try: await store_bulk(logg_cur, to_store) except: break # DB down, breaking to reconnect last_stored_time = time.time() await asyncio.sleep(settings.batch_sleep) Consumer #3
  • 13. Code is public on gitlab https://gitlab.skypicker.com/ondrej/faqstorer www.orwen.org code.kiwi.com www.kiwi.com/jobs/ Check graphs...

Editor's Notes

  • #3: Talk more about Kiwi.com Skyscanner, Momondo
  • #4: 5 TB Postgres Database
  • #7: PEP 492 -- Coroutines with async and await syntax, 09-Apr-2015 Python 3.5