SlideShare a Scribd company logo
using python
web scraping
Paul Schreiber
paul.schreiber@fivethirtyeight.com
paulschreiber@gmail.com
@paulschreiber
Web Scraping with Python
Web Scraping with Python
Web Scraping with Python
Web Scraping with Python
Web Scraping with Python
☁
Web Scraping with Python
</>
Fetching pages
➜ urllib

➜ urllib2

➜ urllib (Python 3)

➜ requests
import'requests'
page'='requests.get('http://www.ire.org/')
Fetch one page
import'requests'
base_url'=''http://www.fishing.ocean/p/%s''
for'i'in'range(0,'10):'
''url'='base_url'%'i'
''page'='requests.get(url)
Fetch a set of results
import'requests'
page'='requests.get('http://www.ire.org/')'
with'open("index.html",'"wb")'as'html:'
''html.write(page.content)
Download a file
Parsing data
➜ Regular Expressions

➜ CSS Selectors

➜ XPath

➜ Object Hierarchy

➜ Object Searching
<html>'
''<head><title>Green'Eggs'and'Ham</title></head>'
''<body>'
'''<ol>'
''''''<li>Green'Eggs</li>'
''''''<li>Ham</li>'
'''</ol>'
''</body>'
</html>
import're'
item_re'='re.compile("<li[^>]*>([^<]+?)</
li>")'
item_re.findall(html)
Regular Expressions DON’T
DO THIS!
from'bs4'import'BeautifulSoup'
soup'='BeautifulSoup(html)'
[s.text'for's'in'soup.select("li")]
CSS Selectors
from'lxml'import'etree'
from'StringIO'import'*'
html'='StringIO(html)'
tree'='etree.parse(html)'
[s.text'for's'in'tree.xpath('/ol/li')]
XPath
import'requests'
from'bs4'import'BeautifulSoup'
page'='requests.get('http://www.ire.org/')'
soup'='BeautifulSoup(page.content)'
print'"The'title'is'"'+'soup.title
Object Hierarchy
from'bs4'import'BeautifulSoup'
soup'='BeautifulSoup(html)'
[s.text'for's'in'soup.find_all("li")]
Object Searching
Web Scraping with Python
import'csv'
with'open('shoes.csv',''wb')'as'csvfile:'
''''shoe_writer'='csv.writer(csvfile)'
''''for'line'in'shoe_list:'
''''''''shoe_writer.writerow(line)'
Write CSV
output'='open("shoes.txt",'"w")'
for'row'in'data:'
''''output.write("t".join(row)'+'"n")'
output.close()'
Write TSV
import'json'
with'open('shoes.json',''wb')'as'outfile:'
''''json.dump(my_json,'outfile)'
Write JSON
Web Scraping with Python
Web Scraping with Python
Web Scraping with Python
workon'web_scraping
WTA Rankings
EXAMPLE 1
Web Scraping with Python
WTA Rankings
➜ fetch page

➜ parse cells

➜ write to file
import'csv'
import'requests'
from'bs4'import'BeautifulSoup'
url'=''http://www.wtatennis.com/singlesZ
rankings''
page'='requests.get(url)'
soup'='BeautifulSoup(page.content)
WTA Rankings
soup.select("#myTable'td")
WTA Rankings
[s'for's'in'soup.select("#myTable'td")]
WTA Rankings
Web Scraping with Python
[s.get_text()'for's'in'
soup.select("#myTable'td")]
WTA Rankings
Web Scraping with Python
[s.get_text().strip()'for's'in'
soup.select("#myTable'td")]
WTA Rankings
Web Scraping with Python
cells'='[s.get_text().strip()'for's'in'
soup.select("#myTable'td")]
WTA Rankings
for'i'in'range(0,'3):'
''print'cells[i*7:i*7+7]
WTA Rankings
Web Scraping with Python
with'open('wta.csv',''wb')'as'csvfile:'
''wtawriter'='csv.writer(csvfile)'
''for'i'in'range(0,'3):'
''''wtawriter.writerow(cells[i*7:i*7+7])
WTA Rankings
NY Election
Boards
EXAMPLE 2
Web Scraping with Python
NY Election Boards
➜ list counties

➜ loop over counties

➜ fetch county pages

➜ parse county data

➜ write to file
import'requests'
from'bs4'import'BeautifulSoup'
url'=''http://www.elections.ny.gov/
CountyBoards.html''
page'='requests.get(url)'
soup'='BeautifulSoup(page.content)
NY Election Boards
soup.select("area")
NY Election Boards
Web Scraping with Python
Web Scraping with Python
Web Scraping with Python
Web Scraping with Python
counties'='soup.select("area")'
county_urls'='[u.get('href')'for'u'in'
counties]
NY Election Boards
Web Scraping with Python
counties'='soup.select("area")'
county_urls'='[u.get('href')'for'u'in'
counties]'
county_urls'='county_urls[1:]'
county_urls'='list(set(county_urls))
NY Election Boards
for'url'in'county_urls[0:3]:'
''''print'"Fetching'%s"'%'url'
''''page'='requests.get(url)'
''''soup'='BeautifulSoup(page.content)'
''''lines'='[s'for's'in'soup.select("th")
[0].strings]'
''''data.append(lines)
NY Election Boards
output'='open("boards.txt",'"w")'
for'row'in'data:'
''''output.write("t".join(row)'+'"n")'
output.close()
NY Election Boards
ACEC
Members
EXAMPLE 3
Web Scraping with Python
ACEC Members
➜ loop over pages

➜ fetch result table

➜ parse name, id, location

➜ write to file
import'requests'
import'json'
from'bs4'import'BeautifulSoup'
base_url'=''http://www.acec.ca/about_acec/
search_member_firms/
business_sector_search.html/search/
business/page/%s'
ACEC Members
url'='base_url'%'1'
page'='requests.get(url)'
soup'='BeautifulSoup(page.content)'
soup.find(id='resulttable')
ACEC Members
Web Scraping with Python
url'='base_url'%'1'
page'='requests.get(url)'
soup'='BeautifulSoup(page.content)'
table'='soup.find(id='resulttable')'
rows'='table.find_all('tr')
ACEC Members
Web Scraping with Python
Web Scraping with Python
url'='base_url'%'1'
page'='requests.get(url)'
soup'='BeautifulSoup(page.content)'
table'='soup.find(id='resulttable')'
rows'='table.find_all('tr')'
columns'='rows[0].find_all('td')
ACEC Members
Web Scraping with Python
columns'='rows[0].find_all('td')'
company_data'='{'
'''name':'columns[1].a.text,'
'''id':'columns[1].a['href'].split('/')
[Z1],'
'''location':'columns[2].text'
}
ACEC Members
start_page'='1'
end_page'='2'
result'='[]
ACEC Members
for'i'in'range(start_page,'end_page'+'1):'
''''url'='base_url'%'i'
''''print'"Fetching'%s"'%'url'
''''page'='requests.get(url)'
''''soup'='BeautifulSoup(page.content)'
''''table'='soup.find(id='resulttable')'
''''rows'='table.find_all('tr')
ACEC Members
Web Scraping with Python
Web Scraping with Python
Web Scraping with Python
''for'r'in'rows:'
''''columns'='r.find_all('td')'
''''company_data'='{'
'''''''name':'columns[1].a.text,'
'''''''id':'columns[1].a['href'].split('/')[Z1],'
'''''''location':'columns[2].text'
''''}'
''''result.append(company_data)'
ACEC Members
with'open('acec.json',''w')'as'outfile:'
''''json.dump(result,'outfile)
ACEC Members
Web Scraping with Python
</>
Web Scraping with Python
Python Tools
➜ lxml

➜ scrapy

➜ MechanicalSoup

➜ RoboBrowser

➜ pyQuery
Ruby Tools
➜ nokogiri

➜ Mechanize
Not coding? Scrape with:
➜ import.io

➜ Kimono

➜ copy & paste

➜ PDFTables

➜ Tabula
☁
Web Scraping with Python
page'='requests.get(url,'auth=('drseuss','
'hamsandwich'))
Basic Authentication
Web Scraping with Python
page'='requests.get(url,'verify=False)
Self-signed certificates
page'='requests.get(url,'verify='/etc/ssl/
certs.pem')
Specify Certificate Bundle
requests.exceptions.SSLError:5hostname5
'shrub.ca'5doesn't5match5either5of5
'www.arthurlaw.ca',5'arthurlaw.ca'5
$'pip'install'pyopenssl'
$'pip'install'ndgZhttpsclient'
$'pip'install'pyasn1
Server Name Indication (SNI)
Web Scraping with Python
UnicodeEncodeError:''ascii','u'Cornet,'
Alizxe9','12,'13,''ordinal'not'in'
range(128)''
Fix
myvar.encode("utfZ8")
Unicode
Web Scraping with Python
page'='requests.get(url)'
if'(page.status_code'>='400):'
''...'
else:'
''...
Server Errors
try:'
''''r'='requests.get(url)'
except'requests.exceptions.RequestException'as'e:'
''''print'e'
''''sys.exit(1)'
Exceptions
headers'='{'
'''''UserZAgent':''Mozilla/3000''
}'
response'='requests.get(url,'headers=headers)
Browser Disallowed
import'time'
for'i'in'range(0,'10):'
''url'='base_url'%'i'
''page'='requests.get(url)'
''time.sleep(1)
Rate Limiting/Slow Servers
requests.get("http://greeneggs.ham/",'params='
{'name':''sam',''verb':''are'})
Query String
requests.post("http://greeneggs.ham/",'data='
{'name':''sam',''verb':''are'})
POST a form
Web Scraping with Python
paritcipate
<strong>'
''<em>foo</strong>'
</em>
Web Scraping with Python
!
github.com/paulschreiber/nicar15
Many graphics from The Noun Project

Binoculars by Stephen West. Broken file by Maxi Koichi.

Broom by Anna Weiss. Chess by Matt Brooks.

Cube by Luis Rodrigues. Firewall by Yazmin Alanis.

Frown by Simple Icons. Lock by Edward Boatman.

Wrench by Tony Gines.

More Related Content

PDF
Downloading the internet with Python + Scrapy
PDF
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
PDF
Fun with Python
PDF
Web Crawling Modeling with Scrapy Models #TDC2014
PDF
Web Scrapping with Python
PDF
Scrapy workshop
PDF
Python, web scraping and content management: Scrapy and Django
PPTX
How to scraping content from web for location-based mobile app.
Downloading the internet with Python + Scrapy
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Fun with Python
Web Crawling Modeling with Scrapy Models #TDC2014
Web Scrapping with Python
Scrapy workshop
Python, web scraping and content management: Scrapy and Django
How to scraping content from web for location-based mobile app.

What's hot (20)

PDF
Pydata-Python tools for webscraping
PDF
Scrapy talk at DataPhilly
PDF
Selenium&amp;scrapy
PDF
Assumptions: Check yo'self before you wreck yourself
PDF
Webscraping with asyncio
PPTX
PDF
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
PPT
Django
PDF
Routing @ Scuk.cz
PPTX
CouchDB Day NYC 2017: MapReduce Views
PDF
Open Hack London - Introduction to YQL
PDF
Hd insight programming
PDF
Go Web Development
PDF
Django - 次の一歩 gumiStudy#3
PDF
Cross Domain Web
Mashups with JQuery and Google App Engine
PDF
Undercover Pods / WP Functions
PDF
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
PDF
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
PDF
Essential git fu for tech writers
PDF
Introduction to the Pods JSON API
Pydata-Python tools for webscraping
Scrapy talk at DataPhilly
Selenium&amp;scrapy
Assumptions: Check yo'self before you wreck yourself
Webscraping with asyncio
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Django
Routing @ Scuk.cz
CouchDB Day NYC 2017: MapReduce Views
Open Hack London - Introduction to YQL
Hd insight programming
Go Web Development
Django - 次の一歩 gumiStudy#3
Cross Domain Web
Mashups with JQuery and Google App Engine
Undercover Pods / WP Functions
Grails 1.2 探検隊 -新たな聖杯をもとめて・・・-
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
Essential git fu for tech writers
Introduction to the Pods JSON API
Ad

Viewers also liked (20)

PDF
No excuses user research
PPTX
Some Advanced Remarketing Ideas
PPTX
The Science of Marketing Automation
PDF
How to: Viral Marketing + Brand Storytelling
PDF
Intro to Mixpanel
PDF
Stop Leaving Money on the Table! Optimizing your Site for Users and Revenue
PDF
How to Plug a Leaky Sales Funnel With Facebook Retargeting
PDF
10 Ways You're Using AdWords Wrong and How to Correct Those Practices
PPTX
Brenda Spoonemore - A biz dev playbook for startups: Why, when and how to do ...
PDF
10 Mobile Marketing Campaigns That Went Viral and Made Millions
PDF
The Essentials of Community Building by Mack Fogelson
PDF
The Beginners Guide to Startup PR #startuppr
PDF
HTML & CSS Masterclass
PPTX
Biz Dev 101 - An Interactive Workshop on How Deals Get Done
PDF
A Guide to User Research (for People Who Don't Like Talking to Other People)
PPTX
The Science behind Viral marketing
PDF
Mastering Google Adwords In 30 Minutes
PPTX
LinkedIn Ads Platform Master Class
PDF
Wireframes - a brief overview
PDF
Using Your Growth Model to Drive Smarter High Tempo Testing
No excuses user research
Some Advanced Remarketing Ideas
The Science of Marketing Automation
How to: Viral Marketing + Brand Storytelling
Intro to Mixpanel
Stop Leaving Money on the Table! Optimizing your Site for Users and Revenue
How to Plug a Leaky Sales Funnel With Facebook Retargeting
10 Ways You're Using AdWords Wrong and How to Correct Those Practices
Brenda Spoonemore - A biz dev playbook for startups: Why, when and how to do ...
10 Mobile Marketing Campaigns That Went Viral and Made Millions
The Essentials of Community Building by Mack Fogelson
The Beginners Guide to Startup PR #startuppr
HTML & CSS Masterclass
Biz Dev 101 - An Interactive Workshop on How Deals Get Done
A Guide to User Research (for People Who Don't Like Talking to Other People)
The Science behind Viral marketing
Mastering Google Adwords In 30 Minutes
LinkedIn Ads Platform Master Class
Wireframes - a brief overview
Using Your Growth Model to Drive Smarter High Tempo Testing
Ad

Similar to Web Scraping with Python (20)

PDF
How I make a podcast website using serverless technology in 2023
ZIP
Web Scraping In Ruby Utosc 2009.Key
PPTX
Python FDP self learning presentations..
PPTX
Web весна 2013 лекция 6
PPTX
Web осень 2012 лекция 6
PDF
ApacheCon 2005
PDF
GDG İstanbul Şubat Etkinliği - Sunum
PDF
Codeigniter : Two Step View - Concept Implementation
PDF
Let's read code: python-requests library
KEY
Mojolicious - A new hope
PPTX
Dev Jumpstart: Build Your First App with MongoDB
PDF
お題でGroovyプログラミング: Part A
PDF
E2 appspresso hands on lab
PDF
E3 appspresso hands on lab
PPTX
Web Scrapping Using Python
PPTX
MongoDB + Java - Everything you need to know
PPTX
Mongo+java (1)
ODP
CodeIgniter PHP MVC Framework
PDF
Great Developers Steal
KEY
Effective iOS Network Programming Techniques
How I make a podcast website using serverless technology in 2023
Web Scraping In Ruby Utosc 2009.Key
Python FDP self learning presentations..
Web весна 2013 лекция 6
Web осень 2012 лекция 6
ApacheCon 2005
GDG İstanbul Şubat Etkinliği - Sunum
Codeigniter : Two Step View - Concept Implementation
Let's read code: python-requests library
Mojolicious - A new hope
Dev Jumpstart: Build Your First App with MongoDB
お題でGroovyプログラミング: Part A
E2 appspresso hands on lab
E3 appspresso hands on lab
Web Scrapping Using Python
MongoDB + Java - Everything you need to know
Mongo+java (1)
CodeIgniter PHP MVC Framework
Great Developers Steal
Effective iOS Network Programming Techniques

More from Paul Schreiber (18)

PDF
Brooklyn Soloists: personal digital security
PDF
BigWP live blogs
PDF
CreativeMornings FieldTrip: information security for creative folks
PDF
WordCamp for Publishers: Security for Newsrooms
PDF
VIP Workshop: Effective Habits of Development Teams
PDF
BigWP Security Keys
PDF
WordPress NYC: Information Security
PDF
WPNYC: Moving your site to HTTPS
PDF
NICAR delivering the news over HTTPS
PDF
WordCamp US: Delivering the news over HTTPS
PDF
BigWP: Delivering the news over HTTPS
PDF
Delivering the news over HTTPS
PDF
D'oh! Avoid annoyances with Grunt.
PDF
Getting to Consistency
ZIP
Junk Mail
PDF
EqualityCamp: Lessons learned from the Obama Campaign
PDF
Mac Productivity 101
PDF
How NOT to rent a car
Brooklyn Soloists: personal digital security
BigWP live blogs
CreativeMornings FieldTrip: information security for creative folks
WordCamp for Publishers: Security for Newsrooms
VIP Workshop: Effective Habits of Development Teams
BigWP Security Keys
WordPress NYC: Information Security
WPNYC: Moving your site to HTTPS
NICAR delivering the news over HTTPS
WordCamp US: Delivering the news over HTTPS
BigWP: Delivering the news over HTTPS
Delivering the news over HTTPS
D'oh! Avoid annoyances with Grunt.
Getting to Consistency
Junk Mail
EqualityCamp: Lessons learned from the Obama Campaign
Mac Productivity 101
How NOT to rent a car

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced IT Governance
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
Sensors and Actuators in IoT Systems using pdf
Per capita expenditure prediction using model stacking based on satellite ima...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
madgavkar20181017ppt McKinsey Presentation.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced IT Governance
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
“AI and Expert System Decision Support & Business Intelligence Systems”

Web Scraping with Python