SlideShare a Scribd company logo
Webscraping
with Asyncio
José Manuel Ortega
@jmortegac
Python conferences
https://speakerdeck.com/jmortega
Python conferences
http://jmortega.github.io/
Github repository
https://github.com/jmortega/webscraping_asyncio_2016
Agenda
▶ Webscraping python tools
▶ Requests vs aiohttp
▶ Introduction to asyncio
▶ Async client/server
▶ Building a webcrawler with asyncio
▶ Alternatives to asyncio
Webscraping
Python tools
➢ Requests
➢ Beautiful Soup 4
➢ Pyquery
➢ Webscraping
➢ Scrapy
Python tools
➢ Mechanize
➢ Robobrowser
➢ Selenium
Requests http://docs.python-requests.org/en/latest
Web scraping with Python
1. Download webpage with HTTP
module(requests,urllib,aiohttp)
2. Parse the page with
BeautifulSoup/lxml
3. Select elements with Regular
expressions,XPath or css selectors
4. Store results in a database,csv,json
BeautifulSoup
BeautifulSoup
▶ soup =
BeautifulSoup(html_doc,’html.parser’)
▶ Print all: print(soup.prettify())
▶ Print text: print(soup.get_text())
from bs4 import BeautifulSoup
BeautifulSoup functions
▪ find_all(‘a’)→Returns all links
▪ find(‘title’)→Returns the first element <title>
▪ get(‘href’)→Returns the attribute href value
▪ (element).text → Returns the text inside an
element
for link in soup.find_all('a'):
print(link.get('href'))
External/internal links
External/internal links
http://python.ie/pycon-2016/
BeautifulSoup PyCon
BeautifulSoup PyCon Output
Parsers Comparison
PyQuery
PyQuery
PyQuery output
Spiders /crawlers
▶ A Web crawler is an Internet bot that
systematically browses the World Wide Web,
typically for the purpose of Web indexing. A
Web crawler may also be called a Web
spider.
https://en.wikipedia.org/wiki/Web_crawler
Spiders /crawlers
scrapinghub.com
Scrapy
https://pypi.python.org/pypi/Scrapy/1.1.2
Scrapy
▶ Uses a mechanism based on XPath
expressions called Xpath
Selectors.
▶ Uses Parser LXML to find elements
▶ Twisted for asynchronous
operations
Scrapy advantages
▶ Faster than mechanize because it
uses twisted for asynchronous operations.
▶ Scrapy has better support for html
parsing.
▶ Scrapy has better support for unicode
characters, redirections, gzipped
responses, encodings.
▶ You can export the extracted data directly
to JSON,XML and CSV.
Export data
▶ scrapy crawl <spider_name>
▶ $ scrapy crawl <spider_name> -o items.json -t json
▶ $ scrapy crawl <spider_name> -o items.csv -t csv
▶ $ scrapy crawl <spider_name> -o items.xml -t xml
▶
Scrapy concurrency
The concurrency problem
▶ Different approaches:
▶ Multiple processes
▶ Threads
▶ Separate distributed machines
▶ Asynchronous programming(event
loop)
Requests problems
▶ Requests operations are blocking the
main thread
▶ It pauses until operation completed
▶ We need one thread for each request if
we want non-blocking operations
Threads problems
▶ Get Overhead
▶ Stack size
▶ Context changes
▶ Synchronization
Solution
▶NOT USE THREADS
▶USE ONE THREAD
▶+ EVENT LOOP
New concepts
▶ Event loop
▶ Async
▶ Await
▶ Futures
▶ Coroutines
▶ Tasks
▶ Executors
Event loop implementations
▶ Asyncio
▶ https://docs.python.org/3.4/library/asyncio.html
▶ Tornado web server
▶ http://www.tornadoweb.org/en/stable
▶ Twisted
▶ https://twistedmatrix.com
▶ Gevent
▶ http://www.gevent.org
Asyncio def.
Asyncio
▶ Python >=3.3
▶ Event-loop framework
▶ I/O Asynchronous
▶ Non-blocking approach with sockets
▶ All requests in one thread
▶ Event-driven switching
▶ aio-http module for make requests
asynchronously
Asyncio
▶ Interoperatibility with other frameworks
Requests vs aiohttp
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def hello():
async with ClientSession() as session:
async with session.get("http://httpbin.org/headers") as response:
response = await response.read()
print(response)
loop = asyncio.get_event_loop()
loop.run_until_complete(hello())
import requests
def hello()
return requests.get("http://httpbin.org/get")
print(hello())
Event Loop
▶ An event loop allow us to write asynchronous
code using callbacks or coroutines.
▶ Event loop function like task switcher,just the way
operating systems switch between active tasks on the
CPU.
▶ The idea is that we have an event loop running until all
tasks scheduled are completed.
▶ Features and tasks are created through the event loop.
Event Loop
▶ An event loop is used to orchestrate the
execution of the coroutines.
▶ asyncio.get_event_loop()
▶ asyncio.run_until_complete(coroutines,futures)
▶ asyncio.run_forever()
▶ asyncio.stop()
Starting Event Loop
Coroutines
▶ Coroutines are functions that allow for
multitasking without requiring multiple
threads or processes.
▶ Coroutines are like functions, but they can be
suspended or resumed at certain points in the
code.
▶ Coroutines allow write asynchronous code that
combines the efficiency of callbacks with the
classic good looks of multithreaded.
Coroutines 3.4 vs 3.5
import asyncio
@asyncio.coroutine
def fetch(self, url):
response = yield from self.session.get(url)
body = yield from response.read()
import asyncio
async def fetch(self, url):
response = await self.session.get(url)
body = await response.read()
Coroutines in event loop
#!/usr/local/bin/python3.5
import asyncio
import aiohttp
async def get_page(url):
response = await aiohttp.request('GET', url)
body = await response.read()
print(body)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([get_page('http://python.org'),
get_page('http://pycon.org')]))
Requests in event loop
async def getpage_with_requests(url):
return await loop.run_in_executor(None,requests.get,url)
#methods equivalents
async def getpage_with_aiohttp(url):
with aitohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.read()
Tasks
▶ The asyncio.Task class is a subclass of
asyncio.Future to encapsulate and manage
coroutines.
▶ Allow independently running tasks to run
concurrently with other tasks on the same event
loop.
▶ When a coroutine is wrapped in a task, it
connects the task to the event loop.
Tasks
Tasks
Tasks
Tasks execution
Futures
▶ To manage an object Future in Asyncio, we
must declare the following:
▶ import asyncio
▶ future = asyncio.Future()
▶ https://docs.python.org/3/library/asyncio
-task.html#future
▶ https://docs.python.org/3/library/concurr
ent.futures.html
Futures
▶ The asyncio.Future class is essentially a
promise of a result.
▶ A Future will returns the results when they
are available, and once it receives results, it
will pass them along to all the registered
callbacks.
▶ Each future is a task to be executed in the
event loop
Futures
Semaphores
▶ Adding synchronization
▶ Limiting number of concurrent requests.
▶ The argument indicates the number of
simultaneous requests we want to allow.
▶ sem = asyncio.Semaphore(5)
with (await sem):
page = await get(url, compress=True)
Async Client /server
▶ asyncio.start_server
▶ server =
asyncio.start_server(handle_connection,host=HOST,port=PORT)
Async Client /server
▶ asyncio.open_connection
Async Client /server
Async Web crawler
Async Web crawler
▶ Send asynchronous requests to all the links
on a web page and add the responses to a
queue to be processed as we go.
▶ Coroutines allow running independent tasks and
processing their results in 3 ways:
▶ Using asyncio.as_completed →by processing
the results as they come.
▶ Using asyncio.gather→ only once they have all
finished loading.
▶ Using asyncio.ensure_future
Async Web crawler
import asyncio
import random
@asyncio.coroutine
def get_url(url):
wait_time = random.randint(1, 4)
yield from asyncio.sleep(wait_time)
print('Done: URL {} took {}s to get!'.format(url, wait_time))
return url, wait_time
@asyncio.coroutine
def process_results_as_come_in():
coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']]
for coroutine in asyncio.as_completed(coroutines):
url, wait_time = yield from coroutine
print('Coroutine for {} is done'.format(url))
def main():
loop = asyncio.get_event_loop()
print(“Process results as they come in:")
loop.run_until_complete(process_results_as_come_in())
if __name__ == '__main__':
main()
asyncio.as_completed
Async Web crawler execution
Async Web crawler
import asyncio
import random
@asyncio.coroutine
def get_url(url):
wait_time = random.randint(1, 4)
yield from asyncio.sleep(wait_time)
print('Done: URL {} took {}s to get!'.format(url, wait_time))
return url, wait_time
@asyncio.coroutine
def process_once_everything_ready():
coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']]
results = yield from asyncio.gather(*coroutines)
print(results)
def main():
loop = asyncio.get_event_loop()
print(“Process results once they are all ready:")
loop.run_until_complete(process_once_everything_ready())
if __name__ == '__main__':
main()
asyncio.gather
asyncio.gather
From Python documentation, this is what asyncio.gather does:
asyncio.gather(*coros_or_futures, loop=None,
return_exceptions=False)
Return a future aggregating results from the given coroutine
objects or futures.
All futures must share the same event loop. If all the tasks
are done successfully, the returned future’s result is the
list of results (in the order of the original sequence, not
necessarily the order of results arrival). If
return_exceptions is True, exceptions in the tasks are
treated the same as successful results, and gathered in the
result list; otherwise, the first raised exception will be
immediately propagated to the returned future.
Async Web crawler
import asyncio
import random
@asyncio.coroutine
def get_url(url):
wait_time = random.randint(1, 4)
yield from asyncio.sleep(wait_time)
print('Done: URL {} took {}s to get!'.format(url, wait_time))
return url, wait_time
@asyncio.coroutine
def process_ensure_future():
tasks= [asyncio.ensure_future(get_url(url) )for url in ['URL1',
'URL2', 'URL3']]
results = yield from asyncio.wait(tasks)
print(results)
def main():
loop = asyncio.get_event_loop()
print(“Process ensure future:")
loop.run_until_complete(process_ensure_future())
if __name__ == '__main__':
main()
asyncio.ensure_future
Async Web crawler execution
Async Web downloader
Async Web downloader faster
Async Web downloader
▶ With get_partial_content
▶ With download_coroutine
Async Extracting links with r.e
Async Extracting links with bs4
Async Extracting links execution
▶ With bs4
▶ With regex
Alternatives to asyncio
▶ ThreadPoolExecutor
▶ https://docs.python.org/3.5/library/concurrent.futures.html#concurrent.fut
ures.ThreadPoolExecutor
▶ ProcessPoolExecutor
▶ https://docs.python.org/3.5/library/concurrent.futures.html#concur
rent.futures.ProcessPoolExecutor
▶ Parallel python
▶ http://www.parallelpython.com
Parallel python
▶ SMP(symmetric multiprocessing)
architecture with multiple cores in the same
machine
▶ Distribute tasks in multiple machines
▶ Cluster
ProcessPoolExecutor
number_of_cpus = cpu_count()
References
▶ http://www.crummy.com/software/BeautifulSoup
▶ http://scrapy.org
▶ http://docs.webscraping.com
▶ https://github.com/KeepSafe/aiohttp
▶ http://aiohttp.readthedocs.io/en/stable/
▶ https://docs.python.org/3.4/library/asyncio.html
▶ https://github.com/REMitchell/python-scraping
Books
Books
Thank you!
@jmortegac
http://speakerdeck.com/jmortega
http://github.com/jmortega

More Related Content

PDF
Pydata-Python tools for webscraping
PDF
Scrapy workshop
PDF
Web Scraping with Python
PDF
Web Crawling Modeling with Scrapy Models #TDC2014
PDF
Application Logging With Logstash
PDF
LogStash - Yes, logging can be awesome
PDF
Application Logging With The ELK Stack
PDF
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Pydata-Python tools for webscraping
Scrapy workshop
Web Scraping with Python
Web Crawling Modeling with Scrapy Models #TDC2014
Application Logging With Logstash
LogStash - Yes, logging can be awesome
Application Logging With The ELK Stack
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...

What's hot (20)

PDF
Logstash for SEO: come monitorare i Log del Web Server in realtime
PDF
Scrapy talk at DataPhilly
PDF
From zero to hero - Easy log centralization with Logstash and Elasticsearch
PDF
Designing net-aws-glacier
PDF
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
PDF
HTTP For the Good or the Bad
KEY
dotCloud and go
PPT
Mining Ruby Gem vulnerabilities for Fun and No Profit.
PDF
Selenium sandwich-3: Being where you aren't.
PDF
Testing your infrastructure with litmus
PDF
Http capturing
PDF
Py conkr 20150829_docker-python
PDF
Legacy applications - 4Developes konferencja, Piotr Pasich
PPTX
Deploying E.L.K stack w Puppet
PDF
Ansible tips & tricks
PPT
{{more}} Kibana4
ODP
Using Logstash, elasticsearch & kibana
PDF
Rugged Driven Development with Gauntlt
PDF
Building A Poor man’s Fir3Ey3 Mail Scanner
PDF
Asynchronous PHP and Real-time Messaging
Logstash for SEO: come monitorare i Log del Web Server in realtime
Scrapy talk at DataPhilly
From zero to hero - Easy log centralization with Logstash and Elasticsearch
Designing net-aws-glacier
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
HTTP For the Good or the Bad
dotCloud and go
Mining Ruby Gem vulnerabilities for Fun and No Profit.
Selenium sandwich-3: Being where you aren't.
Testing your infrastructure with litmus
Http capturing
Py conkr 20150829_docker-python
Legacy applications - 4Developes konferencja, Piotr Pasich
Deploying E.L.K stack w Puppet
Ansible tips & tricks
{{more}} Kibana4
Using Logstash, elasticsearch & kibana
Rugged Driven Development with Gauntlt
Building A Poor man’s Fir3Ey3 Mail Scanner
Asynchronous PHP and Real-time Messaging
Ad

Viewers also liked (20)

PDF
Scraping the web with python
PDF
OSINT tools for security auditing with python
PDF
Faster Python, FOSDEM
PDF
Dive into Python Class
PDF
Python on Rails 2014
PPTX
Scrapy-101
PDF
Python class
PDF
The future of async i/o in Python
PDF
A deep dive into PEP-3156 and the new asyncio module
PDF
Python, do you even async?
TXT
Comandos para ubuntu 400 que debes conocer
PDF
Python master class 3
PDF
Python Async IO Horizon
PDF
Practical continuous quality gates for development process
PDF
Async Tasks with Django Channels
PPTX
The Awesome Python Class Part-4
PPTX
Async programming and python
PDF
Regexp
PDF
What is the best full text search engine for Python?
PDF
Python as number crunching code glue
Scraping the web with python
OSINT tools for security auditing with python
Faster Python, FOSDEM
Dive into Python Class
Python on Rails 2014
Scrapy-101
Python class
The future of async i/o in Python
A deep dive into PEP-3156 and the new asyncio module
Python, do you even async?
Comandos para ubuntu 400 que debes conocer
Python master class 3
Python Async IO Horizon
Practical continuous quality gates for development process
Async Tasks with Django Channels
The Awesome Python Class Part-4
Async programming and python
Regexp
What is the best full text search engine for Python?
Python as number crunching code glue
Ad

Similar to Webscraping with asyncio (20)

PPT
AsyncIO To Speed Up Your Crawler
PDF
HOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOP
PDF
Syncing up with Python’s asyncio for (micro) service development, Joir-dan Gumbs
PDF
Introduction to Python Asyncio
PPTX
Asynchronous programming with django
PDF
The journey of asyncio adoption in instagram
PPTX
Understanding concurrency
PDF
Asynchronous Python and You
PDF
Introduction to asyncio
PDF
Asynchronous Python at Kumparan
PDF
BUILDING APPS WITH ASYNCIO
PDF
asyncio community, one year later
PDF
Python Coroutines, Present and Future
PPTX
Concurrency models in python
PDF
Tornado in Depth
PDF
Asynchronous Python A Gentle Introduction
PDF
Async Web Frameworks in Python
PDF
PyCon Canada 2019 - Introduction to Asynchronous Programming
PDF
Elegant concurrency
PDF
연구자 및 교육자를 위한 계산 및 분석 플랫폼 설계 - PyCon KR 2015
AsyncIO To Speed Up Your Crawler
HOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOP
Syncing up with Python’s asyncio for (micro) service development, Joir-dan Gumbs
Introduction to Python Asyncio
Asynchronous programming with django
The journey of asyncio adoption in instagram
Understanding concurrency
Asynchronous Python and You
Introduction to asyncio
Asynchronous Python at Kumparan
BUILDING APPS WITH ASYNCIO
asyncio community, one year later
Python Coroutines, Present and Future
Concurrency models in python
Tornado in Depth
Asynchronous Python A Gentle Introduction
Async Web Frameworks in Python
PyCon Canada 2019 - Introduction to Asynchronous Programming
Elegant concurrency
연구자 및 교육자를 위한 계산 및 분석 플랫폼 설계 - PyCon KR 2015

More from Jose Manuel Ortega Candel (20)

PDF
Seguridad y auditorías en Modelos grandes del lenguaje (LLM)
PDF
Seguridad y auditorías en Modelos grandes del lenguaje (LLM).pdf
PDF
Beyond the hype: The reality of AI security.pdf
PDF
Seguridad de APIs en Drupal_ herramientas, mejores prácticas y estrategias pa...
PDF
Security and auditing tools in Large Language Models (LLM).pdf
PDF
Herramientas de benchmarks para evaluar el rendimiento en máquinas y aplicaci...
PDF
Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdf
PDF
PyGoat Analizando la seguridad en aplicaciones Django.pdf
PDF
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...
PDF
Evolution of security strategies in K8s environments- All day devops
PDF
Evolution of security strategies in K8s environments.pdf
PDF
Implementing Observability for Kubernetes.pdf
PDF
Computación distribuida usando Python
PDF
Seguridad en arquitecturas serverless y entornos cloud
PDF
Construyendo arquitecturas zero trust sobre entornos cloud
PDF
Tips and tricks for data science projects with Python
PDF
Sharing secret keys in Docker containers and K8s
PDF
Implementing cert-manager in K8s
PDF
Python para equipos de ciberseguridad(pycones)
PDF
Python para equipos de ciberseguridad
Seguridad y auditorías en Modelos grandes del lenguaje (LLM)
Seguridad y auditorías en Modelos grandes del lenguaje (LLM).pdf
Beyond the hype: The reality of AI security.pdf
Seguridad de APIs en Drupal_ herramientas, mejores prácticas y estrategias pa...
Security and auditing tools in Large Language Models (LLM).pdf
Herramientas de benchmarks para evaluar el rendimiento en máquinas y aplicaci...
Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdf
PyGoat Analizando la seguridad en aplicaciones Django.pdf
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...
Evolution of security strategies in K8s environments- All day devops
Evolution of security strategies in K8s environments.pdf
Implementing Observability for Kubernetes.pdf
Computación distribuida usando Python
Seguridad en arquitecturas serverless y entornos cloud
Construyendo arquitecturas zero trust sobre entornos cloud
Tips and tricks for data science projects with Python
Sharing secret keys in Docker containers and K8s
Implementing cert-manager in K8s
Python para equipos de ciberseguridad(pycones)
Python para equipos de ciberseguridad

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
assetexplorer- product-overview - presentation
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
L1 - Introduction to python Backend.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
2025 Textile ERP Trends: SAP, Odoo & Oracle
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
assetexplorer- product-overview - presentation
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Softaken Excel to vCard Converter Software.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Understanding Forklifts - TECH EHS Solution
How to Migrate SBCGlobal Email to Yahoo Easily
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PTS Company Brochure 2025 (1).pdf.......
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Operating system designcfffgfgggggggvggggggggg
L1 - Introduction to python Backend.pptx
Odoo Companies in India – Driving Business Transformation.pdf
top salesforce developer skills in 2025.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Design an Analysis of Algorithms II-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025

Webscraping with asyncio

  • 5. Agenda ▶ Webscraping python tools ▶ Requests vs aiohttp ▶ Introduction to asyncio ▶ Async client/server ▶ Building a webcrawler with asyncio ▶ Alternatives to asyncio
  • 7. Python tools ➢ Requests ➢ Beautiful Soup 4 ➢ Pyquery ➢ Webscraping ➢ Scrapy
  • 8. Python tools ➢ Mechanize ➢ Robobrowser ➢ Selenium
  • 10. Web scraping with Python 1. Download webpage with HTTP module(requests,urllib,aiohttp) 2. Parse the page with BeautifulSoup/lxml 3. Select elements with Regular expressions,XPath or css selectors 4. Store results in a database,csv,json
  • 12. BeautifulSoup ▶ soup = BeautifulSoup(html_doc,’html.parser’) ▶ Print all: print(soup.prettify()) ▶ Print text: print(soup.get_text()) from bs4 import BeautifulSoup
  • 13. BeautifulSoup functions ▪ find_all(‘a’)→Returns all links ▪ find(‘title’)→Returns the first element <title> ▪ get(‘href’)→Returns the attribute href value ▪ (element).text → Returns the text inside an element for link in soup.find_all('a'): print(link.get('href'))
  • 22. Spiders /crawlers ▶ A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider. https://en.wikipedia.org/wiki/Web_crawler
  • 25. Scrapy ▶ Uses a mechanism based on XPath expressions called Xpath Selectors. ▶ Uses Parser LXML to find elements ▶ Twisted for asynchronous operations
  • 26. Scrapy advantages ▶ Faster than mechanize because it uses twisted for asynchronous operations. ▶ Scrapy has better support for html parsing. ▶ Scrapy has better support for unicode characters, redirections, gzipped responses, encodings. ▶ You can export the extracted data directly to JSON,XML and CSV.
  • 27. Export data ▶ scrapy crawl <spider_name> ▶ $ scrapy crawl <spider_name> -o items.json -t json ▶ $ scrapy crawl <spider_name> -o items.csv -t csv ▶ $ scrapy crawl <spider_name> -o items.xml -t xml ▶
  • 29. The concurrency problem ▶ Different approaches: ▶ Multiple processes ▶ Threads ▶ Separate distributed machines ▶ Asynchronous programming(event loop)
  • 30. Requests problems ▶ Requests operations are blocking the main thread ▶ It pauses until operation completed ▶ We need one thread for each request if we want non-blocking operations
  • 31. Threads problems ▶ Get Overhead ▶ Stack size ▶ Context changes ▶ Synchronization
  • 32. Solution ▶NOT USE THREADS ▶USE ONE THREAD ▶+ EVENT LOOP
  • 33. New concepts ▶ Event loop ▶ Async ▶ Await ▶ Futures ▶ Coroutines ▶ Tasks ▶ Executors
  • 34. Event loop implementations ▶ Asyncio ▶ https://docs.python.org/3.4/library/asyncio.html ▶ Tornado web server ▶ http://www.tornadoweb.org/en/stable ▶ Twisted ▶ https://twistedmatrix.com ▶ Gevent ▶ http://www.gevent.org
  • 36. Asyncio ▶ Python >=3.3 ▶ Event-loop framework ▶ I/O Asynchronous ▶ Non-blocking approach with sockets ▶ All requests in one thread ▶ Event-driven switching ▶ aio-http module for make requests asynchronously
  • 38. Requests vs aiohttp #!/usr/local/bin/python3.5 import asyncio from aiohttp import ClientSession async def hello(): async with ClientSession() as session: async with session.get("http://httpbin.org/headers") as response: response = await response.read() print(response) loop = asyncio.get_event_loop() loop.run_until_complete(hello()) import requests def hello() return requests.get("http://httpbin.org/get") print(hello())
  • 39. Event Loop ▶ An event loop allow us to write asynchronous code using callbacks or coroutines. ▶ Event loop function like task switcher,just the way operating systems switch between active tasks on the CPU. ▶ The idea is that we have an event loop running until all tasks scheduled are completed. ▶ Features and tasks are created through the event loop.
  • 40. Event Loop ▶ An event loop is used to orchestrate the execution of the coroutines. ▶ asyncio.get_event_loop() ▶ asyncio.run_until_complete(coroutines,futures) ▶ asyncio.run_forever() ▶ asyncio.stop()
  • 42. Coroutines ▶ Coroutines are functions that allow for multitasking without requiring multiple threads or processes. ▶ Coroutines are like functions, but they can be suspended or resumed at certain points in the code. ▶ Coroutines allow write asynchronous code that combines the efficiency of callbacks with the classic good looks of multithreaded.
  • 43. Coroutines 3.4 vs 3.5 import asyncio @asyncio.coroutine def fetch(self, url): response = yield from self.session.get(url) body = yield from response.read() import asyncio async def fetch(self, url): response = await self.session.get(url) body = await response.read()
  • 44. Coroutines in event loop #!/usr/local/bin/python3.5 import asyncio import aiohttp async def get_page(url): response = await aiohttp.request('GET', url) body = await response.read() print(body) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait([get_page('http://python.org'), get_page('http://pycon.org')]))
  • 45. Requests in event loop async def getpage_with_requests(url): return await loop.run_in_executor(None,requests.get,url) #methods equivalents async def getpage_with_aiohttp(url): with aitohttp.ClientSession() as session: async with session.get(url) as response: return await response.read()
  • 46. Tasks ▶ The asyncio.Task class is a subclass of asyncio.Future to encapsulate and manage coroutines. ▶ Allow independently running tasks to run concurrently with other tasks on the same event loop. ▶ When a coroutine is wrapped in a task, it connects the task to the event loop.
  • 47. Tasks
  • 48. Tasks
  • 49. Tasks
  • 51. Futures ▶ To manage an object Future in Asyncio, we must declare the following: ▶ import asyncio ▶ future = asyncio.Future() ▶ https://docs.python.org/3/library/asyncio -task.html#future ▶ https://docs.python.org/3/library/concurr ent.futures.html
  • 52. Futures ▶ The asyncio.Future class is essentially a promise of a result. ▶ A Future will returns the results when they are available, and once it receives results, it will pass them along to all the registered callbacks. ▶ Each future is a task to be executed in the event loop
  • 54. Semaphores ▶ Adding synchronization ▶ Limiting number of concurrent requests. ▶ The argument indicates the number of simultaneous requests we want to allow. ▶ sem = asyncio.Semaphore(5) with (await sem): page = await get(url, compress=True)
  • 55. Async Client /server ▶ asyncio.start_server ▶ server = asyncio.start_server(handle_connection,host=HOST,port=PORT)
  • 56. Async Client /server ▶ asyncio.open_connection
  • 59. Async Web crawler ▶ Send asynchronous requests to all the links on a web page and add the responses to a queue to be processed as we go. ▶ Coroutines allow running independent tasks and processing their results in 3 ways: ▶ Using asyncio.as_completed →by processing the results as they come. ▶ Using asyncio.gather→ only once they have all finished loading. ▶ Using asyncio.ensure_future
  • 60. Async Web crawler import asyncio import random @asyncio.coroutine def get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time @asyncio.coroutine def process_results_as_come_in(): coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']] for coroutine in asyncio.as_completed(coroutines): url, wait_time = yield from coroutine print('Coroutine for {} is done'.format(url)) def main(): loop = asyncio.get_event_loop() print(“Process results as they come in:") loop.run_until_complete(process_results_as_come_in()) if __name__ == '__main__': main() asyncio.as_completed
  • 61. Async Web crawler execution
  • 62. Async Web crawler import asyncio import random @asyncio.coroutine def get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time @asyncio.coroutine def process_once_everything_ready(): coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']] results = yield from asyncio.gather(*coroutines) print(results) def main(): loop = asyncio.get_event_loop() print(“Process results once they are all ready:") loop.run_until_complete(process_once_everything_ready()) if __name__ == '__main__': main() asyncio.gather
  • 63. asyncio.gather From Python documentation, this is what asyncio.gather does: asyncio.gather(*coros_or_futures, loop=None, return_exceptions=False) Return a future aggregating results from the given coroutine objects or futures. All futures must share the same event loop. If all the tasks are done successfully, the returned future’s result is the list of results (in the order of the original sequence, not necessarily the order of results arrival). If return_exceptions is True, exceptions in the tasks are treated the same as successful results, and gathered in the result list; otherwise, the first raised exception will be immediately propagated to the returned future.
  • 64. Async Web crawler import asyncio import random @asyncio.coroutine def get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time @asyncio.coroutine def process_ensure_future(): tasks= [asyncio.ensure_future(get_url(url) )for url in ['URL1', 'URL2', 'URL3']] results = yield from asyncio.wait(tasks) print(results) def main(): loop = asyncio.get_event_loop() print(“Process ensure future:") loop.run_until_complete(process_ensure_future()) if __name__ == '__main__': main() asyncio.ensure_future
  • 65. Async Web crawler execution
  • 68. Async Web downloader ▶ With get_partial_content ▶ With download_coroutine
  • 71. Async Extracting links execution ▶ With bs4 ▶ With regex
  • 72. Alternatives to asyncio ▶ ThreadPoolExecutor ▶ https://docs.python.org/3.5/library/concurrent.futures.html#concurrent.fut ures.ThreadPoolExecutor ▶ ProcessPoolExecutor ▶ https://docs.python.org/3.5/library/concurrent.futures.html#concur rent.futures.ProcessPoolExecutor ▶ Parallel python ▶ http://www.parallelpython.com
  • 73. Parallel python ▶ SMP(symmetric multiprocessing) architecture with multiple cores in the same machine ▶ Distribute tasks in multiple machines ▶ Cluster
  • 75. References ▶ http://www.crummy.com/software/BeautifulSoup ▶ http://scrapy.org ▶ http://docs.webscraping.com ▶ https://github.com/KeepSafe/aiohttp ▶ http://aiohttp.readthedocs.io/en/stable/ ▶ https://docs.python.org/3.4/library/asyncio.html ▶ https://github.com/REMitchell/python-scraping
  • 76. Books
  • 77. Books