SlideShare a Scribd company logo
Maxym Kharchenko & m@ team
Writing efficient Python code with
pipelines and generators
Agenda
Python is all about streaming (a.k.a. iteration)
Streaming in Python
# Lists
db_list = ['db1', 'db2', 'db3']
for db in db_list:
print db
# Dictionaries
host_cpu = {'avg': 2.34, 'p99': 98.78, 'min': 0.01}
for stat in host_cpu:
print "%s = %s" % (stat, host_cpu[stat])
# Files, strings
file = open("/etc/oratab")
for line in file:
for word in line.split(" "):
print word
# Whatever is coming out of get_things()
for thing in get_things():
print thing
Quick example: Reading records from a file
def print_databases():
""" Read /etc/oratab and print database names """
file = open("/etc/oratab", 'r')
while True:
line = file.readline() # Get next line
# Check for empty lines
if len(line) == 0 and not line.endswith('n'):
break
# Parsing oratab line into components
db_line = line.strip()
db_info_array = db_line.split(':')
db_name = db_info_array[0]
print db_name
file.close()
Reading records from a file: with “streaming”
def print_databases():
""" Read /etc/oratab and print database names """
with open("/etc/oratab") as file:
for line in file:
print line.strip().split(':')[0]
Style matters!
Ok, let’s do something useful with streaming
 We have a bunch of ORACLE listener logs
 Let’s parse them for “client IPs”
21-AUG-2015 21:29:56 *
(CONNECT_DATA=(SID=orcl)(CID=(PROGRAM=)(HOST=_
_jdbc__)(USER=))) *
(ADDRESS=(PROTOCOL=tcp)(HOST=10.107.137.91)(PO
RT=43105)) * establish * orcl * 0
 And find where the clients are coming from
First attempt at listener log parser
def parse_listener_log(log_name):
""" Parse listener log and return clients
"""
client_hosts = []
with open(log_name) as listener_log:
for line in listener_log:
host_match = <regex magic>
if host_match:
host = <regex magic>
client_hosts.append(host)
return client_hosts
First attempt at listener log parser
def parse_listener_log(log_name):
""" Parse listener log and return clients
"""
client_hosts = []
with open(log_name) as listener_log:
for line in listener_log:
host_match = <regex magic>
if host_match:
host = <regex magic>
client_hosts.append(host)
return client_hosts
MEMORY
WASTE!
Stores all
results until
return
BLOCKING!
Does NOT
return until
the entire log
is processed
Generators for efficiency
def parse_listener_log(log_name):
""" Parse listener log and return clients
"""
client_hosts = []
with open(log_name) as listener_log:
for line in listener_log:
host_match = <regex magic>
if host_match:
host = <regex magic>
client_hosts.append(host)
return client_hosts
Generators for efficiency
def parse_listener_log(log_name):
""" Parse listener log and return clients
"""
client_hosts = []
with open(log_name) as listener_log:
for line in listener_log:
host_match = <regex magic>
if host_match:
host = <regex magic>
client_hosts.append(host)
return client_hosts
Generators for efficiency
def parse_listener_log(log_name):
""" Parse listener log and return clients
"""
with open(log_name) as listener_log:
for line in listener_log:
host_match = <regex magic>
if host_match:
host = <regex magic>
yield hostAdd this !
Generators in a nutshell
def test_generator():
""" Test generator """
print "ENTER()"
for i in range(5):
print "yield i=%d" % i
yield i
print "EXIT()"
# MAIN
for i in test_generator():
print "RET=%d" % i
ENTER()
yield i=0
RET=0
yield i=1
RET=1
yield i=2
RET=2
yield i=3
RET=3
yield i=4
RET=4
EXIT()
Nongenerators in a nutshell
def test_nongenerator():
""" Test no generator """
result = []
print "ENTER()"
for i in range(5):
print "add i=%d" % i
result.append(i)
print "EXIT()"
return result
# MAIN
for i in test_nongenerator():
print "RET=%d" % i
ENTER()
add i=0
add i=1
add i=2
add i=3
add i=4
EXIT()
RET=0
RET=1
RET=2
RET=3
RET=4
Generators to Pipelines
Generator
(extractor)
1 second
per record
100,000
1st:
1 second
100,000
Generator
(filter: 1/2)
2 seconds
per record
Generator
(mapper)
5 seconds
per record
50,000
1st:
5 seconds
50,000
1st:
10 seconds
Generator pipelining in Python
file_handles = open_files(LISTENER_LOGS)
log_lines = extract_lines(file_handles)
client_hosts = extract_client_ips(log_lines)
for host in client_hosts:
print host
Open
files
Extract
lines
Extract
IPs
File
names
File
handles
File
lines
Client
IPs
Generators for simplicity
def open_files(file_names):
""" GENERATOR: file name -> file handle """
for file in file_names:
yield open(file)
Generators for simplicity
def extract_lines(file_handles):
""" GENERATOR: File handles -> file lines
Similar to UNIX: cat file1, file2, …
"""
for file in file_handles:
for line in file:
yield line
Generators for simplicity
def extract_client_ips(lines):
""" GENERATOR: Extract client host
"""
host_regex = re.compile('(HOST=(S+))(PORT=')
for line in lines:
line_match = host_regex.search(line)
if line_match:
yield line_match.groups(0)[0]
Developer’s bliss:
simple input, simple output, trivial function body
Then, pipeline the results
But, really …
Open
files
Extract
lines
IP ->
host
name
File
names
File
handles
File
lines
Client
hosts
Locate
files
Filter
db=orcl
Filter
proto=
TCP
db=orcl
lines
db=orcl
lines
db=orcl
&
prot=TCP
Extract
clients
Client
IPs
Client
hosts
Db
writer
Client
hosts
Text
writer
Why generators ?
 Simple functions that are easy to write and understand
 Non blocking operations:
 TOTAL execution time: faster
 FIRST RESULTS: much faster
 Efficient use of memory
 Potential for parallelization and ASYNC processing
Special thanks to David Beazley …
 For this: http://www.dabeaz.com/generators-uk/GeneratorsUK.pdf
Thank you!

More Related Content

What's hot (20)

Impala: A Modern, Open-Source SQL Engine for Hadoop
Impala: A Modern, Open-Source SQL Engine for Hadoop
All Things Open
 
working with files
working with files
SangeethaSasi1
 
Using Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibana
Alejandro E Brito Monedero
 
Ansible for Beginners
Ansible for Beginners
Arie Bregman
 
CouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text Search
IBM Cloud Data Services
 
RestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message Queue
Gleicon Moraes
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
YoungHeon (Roy) Kim
 
Parse, scale to millions
Parse, scale to millions
Florent Vilmart
 
Database Homework Help
Database Homework Help
Database Homework Help
 
serverstats
serverstats
Ben De Koster
 
How to admin
How to admin
yalegko
 
Value protocols and codables
Value protocols and codables
Florent Vilmart
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
IBM Cloud Data Services
 
Nginx-lua
Nginx-lua
Дэв Тим Афс
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
IBM Cloud Data Services
 
Overloading Perl OPs using XS
Overloading Perl OPs using XS
ℕicolas ℝ.
 
Perl Memory Use - LPW2013
Perl Memory Use - LPW2013
Tim Bunce
 
CouchDB Day NYC 2017: Replication
CouchDB Day NYC 2017: Replication
IBM Cloud Data Services
 
Lies, Damn Lies, and Benchmarks
Lies, Damn Lies, and Benchmarks
Workhorse Computing
 
Workshop on command line tools - day 2
Workshop on command line tools - day 2
Leandro Lima
 
Impala: A Modern, Open-Source SQL Engine for Hadoop
Impala: A Modern, Open-Source SQL Engine for Hadoop
All Things Open
 
Ansible for Beginners
Ansible for Beginners
Arie Bregman
 
RestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message Queue
Gleicon Moraes
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
YoungHeon (Roy) Kim
 
Parse, scale to millions
Parse, scale to millions
Florent Vilmart
 
How to admin
How to admin
yalegko
 
Value protocols and codables
Value protocols and codables
Florent Vilmart
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
IBM Cloud Data Services
 
Overloading Perl OPs using XS
Overloading Perl OPs using XS
ℕicolas ℝ.
 
Perl Memory Use - LPW2013
Perl Memory Use - LPW2013
Tim Bunce
 
Workshop on command line tools - day 2
Workshop on command line tools - day 2
Leandro Lima
 

Similar to Commit2015 kharchenko - python generators - ext (20)

Teasing talk for Flow-based programming made easy with PyF 2.0
Teasing talk for Flow-based programming made easy with PyF 2.0
Jonathan Schemoul
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
Monitoring and Debugging your Live Applications
Monitoring and Debugging your Live Applications
Robert Coup
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Interop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough Code
Jeremy Schulman
 
A look ahead at spark 2.0
A look ahead at spark 2.0
Databricks
 
PySaprk
PySaprk
Giivee The
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)
Itzik Kotler
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Python vs JLizard.... a python logging experience
Python vs JLizard.... a python logging experience
Python Ireland
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resources
Martin Czygan
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Python with data Sciences
Python with data Sciences
Krishna Mohan Mishra
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and Spark
IRJET Journal
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Richard Seymour
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
Srivatsan Ramanujam
 
Teasing talk for Flow-based programming made easy with PyF 2.0
Teasing talk for Flow-based programming made easy with PyF 2.0
Jonathan Schemoul
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
Monitoring and Debugging your Live Applications
Monitoring and Debugging your Live Applications
Robert Coup
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Interop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough Code
Jeremy Schulman
 
A look ahead at spark 2.0
A look ahead at spark 2.0
Databricks
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)
Itzik Kotler
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Python vs JLizard.... a python logging experience
Python vs JLizard.... a python logging experience
Python Ireland
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resources
Martin Czygan
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and Spark
IRJET Journal
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Richard Seymour
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
Srivatsan Ramanujam
 
Ad

Recently uploaded (20)

FME as an Orchestration Tool - Peak of Data & AI 2025
FME as an Orchestration Tool - Peak of Data & AI 2025
Safe Software
 
Transmission Media. (Computer Networks)
Transmission Media. (Computer Networks)
S Pranav (Deepu)
 
Open Source Software Development Methods
Open Source Software Development Methods
VICTOR MAESTRE RAMIREZ
 
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
Maharshi Mallela
 
GDG Douglas - Google AI Agents: Your Next Intern?
GDG Douglas - Google AI Agents: Your Next Intern?
felipeceotto
 
Code and No-Code Journeys: The Coverage Overlook
Code and No-Code Journeys: The Coverage Overlook
Applitools
 
Looking for a BIRT Report Alternative Here’s Why Helical Insight Stands Out.pdf
Looking for a BIRT Report Alternative Here’s Why Helical Insight Stands Out.pdf
Varsha Nayak
 
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Intelli grow
 
Artificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across Industries
SandeepKS52
 
How Insurance Policy Management Software Streamlines Operations
How Insurance Policy Management Software Streamlines Operations
Insurance Tech Services
 
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Maxim Salnikov
 
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
usmanch7829
 
IBM Rational Unified Process For Software Engineering - Introduction
IBM Rational Unified Process For Software Engineering - Introduction
Gaurav Sharma
 
Advanced Token Development - Decentralized Innovation
Advanced Token Development - Decentralized Innovation
arohisinghas720
 
UPDASP a project coordination unit ......
UPDASP a project coordination unit ......
withrj1
 
Who will create the languages of the future?
Who will create the languages of the future?
Jordi Cabot
 
Migrating to Azure Cosmos DB the Right Way
Migrating to Azure Cosmos DB the Right Way
Alexander (Alex) Komyagin
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWS
Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWS
BradBedford3
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 
FME as an Orchestration Tool - Peak of Data & AI 2025
FME as an Orchestration Tool - Peak of Data & AI 2025
Safe Software
 
Transmission Media. (Computer Networks)
Transmission Media. (Computer Networks)
S Pranav (Deepu)
 
Open Source Software Development Methods
Open Source Software Development Methods
VICTOR MAESTRE RAMIREZ
 
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
MOVIE RECOMMENDATION SYSTEM, UDUMULA GOPI REDDY, Y24MC13085.pptx
Maharshi Mallela
 
GDG Douglas - Google AI Agents: Your Next Intern?
GDG Douglas - Google AI Agents: Your Next Intern?
felipeceotto
 
Code and No-Code Journeys: The Coverage Overlook
Code and No-Code Journeys: The Coverage Overlook
Applitools
 
Looking for a BIRT Report Alternative Here’s Why Helical Insight Stands Out.pdf
Looking for a BIRT Report Alternative Here’s Why Helical Insight Stands Out.pdf
Varsha Nayak
 
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Smart Financial Solutions: Money Lender Software, Daily Pigmy & Personal Loan...
Intelli grow
 
Artificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across Industries
SandeepKS52
 
How Insurance Policy Management Software Streamlines Operations
How Insurance Policy Management Software Streamlines Operations
Insurance Tech Services
 
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Maxim Salnikov
 
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
usmanch7829
 
IBM Rational Unified Process For Software Engineering - Introduction
IBM Rational Unified Process For Software Engineering - Introduction
Gaurav Sharma
 
Advanced Token Development - Decentralized Innovation
Advanced Token Development - Decentralized Innovation
arohisinghas720
 
UPDASP a project coordination unit ......
UPDASP a project coordination unit ......
withrj1
 
Who will create the languages of the future?
Who will create the languages of the future?
Jordi Cabot
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWS
Women in Tech: Marketo Engage User Group - June 2025 - AJO with AWS
BradBedford3
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 
Ad

Commit2015 kharchenko - python generators - ext

  • 1. Maxym Kharchenko & m@ team Writing efficient Python code with pipelines and generators
  • 3. Python is all about streaming (a.k.a. iteration)
  • 4. Streaming in Python # Lists db_list = ['db1', 'db2', 'db3'] for db in db_list: print db # Dictionaries host_cpu = {'avg': 2.34, 'p99': 98.78, 'min': 0.01} for stat in host_cpu: print "%s = %s" % (stat, host_cpu[stat]) # Files, strings file = open("/etc/oratab") for line in file: for word in line.split(" "): print word # Whatever is coming out of get_things() for thing in get_things(): print thing
  • 5. Quick example: Reading records from a file def print_databases(): """ Read /etc/oratab and print database names """ file = open("/etc/oratab", 'r') while True: line = file.readline() # Get next line # Check for empty lines if len(line) == 0 and not line.endswith('n'): break # Parsing oratab line into components db_line = line.strip() db_info_array = db_line.split(':') db_name = db_info_array[0] print db_name file.close()
  • 6. Reading records from a file: with “streaming” def print_databases(): """ Read /etc/oratab and print database names """ with open("/etc/oratab") as file: for line in file: print line.strip().split(':')[0]
  • 8. Ok, let’s do something useful with streaming  We have a bunch of ORACLE listener logs  Let’s parse them for “client IPs” 21-AUG-2015 21:29:56 * (CONNECT_DATA=(SID=orcl)(CID=(PROGRAM=)(HOST=_ _jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.107.137.91)(PO RT=43105)) * establish * orcl * 0  And find where the clients are coming from
  • 9. First attempt at listener log parser def parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = [] with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host) return client_hosts
  • 10. First attempt at listener log parser def parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = [] with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host) return client_hosts MEMORY WASTE! Stores all results until return BLOCKING! Does NOT return until the entire log is processed
  • 11. Generators for efficiency def parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = [] with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host) return client_hosts
  • 12. Generators for efficiency def parse_listener_log(log_name): """ Parse listener log and return clients """ client_hosts = [] with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> client_hosts.append(host) return client_hosts
  • 13. Generators for efficiency def parse_listener_log(log_name): """ Parse listener log and return clients """ with open(log_name) as listener_log: for line in listener_log: host_match = <regex magic> if host_match: host = <regex magic> yield hostAdd this !
  • 14. Generators in a nutshell def test_generator(): """ Test generator """ print "ENTER()" for i in range(5): print "yield i=%d" % i yield i print "EXIT()" # MAIN for i in test_generator(): print "RET=%d" % i ENTER() yield i=0 RET=0 yield i=1 RET=1 yield i=2 RET=2 yield i=3 RET=3 yield i=4 RET=4 EXIT()
  • 15. Nongenerators in a nutshell def test_nongenerator(): """ Test no generator """ result = [] print "ENTER()" for i in range(5): print "add i=%d" % i result.append(i) print "EXIT()" return result # MAIN for i in test_nongenerator(): print "RET=%d" % i ENTER() add i=0 add i=1 add i=2 add i=3 add i=4 EXIT() RET=0 RET=1 RET=2 RET=3 RET=4
  • 16. Generators to Pipelines Generator (extractor) 1 second per record 100,000 1st: 1 second 100,000 Generator (filter: 1/2) 2 seconds per record Generator (mapper) 5 seconds per record 50,000 1st: 5 seconds 50,000 1st: 10 seconds
  • 17. Generator pipelining in Python file_handles = open_files(LISTENER_LOGS) log_lines = extract_lines(file_handles) client_hosts = extract_client_ips(log_lines) for host in client_hosts: print host Open files Extract lines Extract IPs File names File handles File lines Client IPs
  • 18. Generators for simplicity def open_files(file_names): """ GENERATOR: file name -> file handle """ for file in file_names: yield open(file)
  • 19. Generators for simplicity def extract_lines(file_handles): """ GENERATOR: File handles -> file lines Similar to UNIX: cat file1, file2, … """ for file in file_handles: for line in file: yield line
  • 20. Generators for simplicity def extract_client_ips(lines): """ GENERATOR: Extract client host """ host_regex = re.compile('(HOST=(S+))(PORT=') for line in lines: line_match = host_regex.search(line) if line_match: yield line_match.groups(0)[0]
  • 21. Developer’s bliss: simple input, simple output, trivial function body
  • 23. But, really … Open files Extract lines IP -> host name File names File handles File lines Client hosts Locate files Filter db=orcl Filter proto= TCP db=orcl lines db=orcl lines db=orcl & prot=TCP Extract clients Client IPs Client hosts Db writer Client hosts Text writer
  • 24. Why generators ?  Simple functions that are easy to write and understand  Non blocking operations:  TOTAL execution time: faster  FIRST RESULTS: much faster  Efficient use of memory  Potential for parallelization and ASYNC processing
  • 25. Special thanks to David Beazley …  For this: http://www.dabeaz.com/generators-uk/GeneratorsUK.pdf

Editor's Notes

  • #7: Doing things “pythonian” way
  • #17: “All” results vs 1st results
  • #26: The best “generator” presentation that I’ve seen