Web Scraping with Python
•Dr Vatan Sehrawat
•Asst. Professor, Computer Sc. & Engg. Department
•RBS-SIET Zainabad
•
[email protected]•8059211113
● What is scraping
● Why we scrape
● How do we do it
● Challenges
● Scrapy
scraping
converting unstructured documents
into structured information
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or viaAPIs
What is Web Scraping?
● Web scraping (web harvesting) is a software
technique of extracting information from
websites
● It focuses on transformation of unstructured
data on the web (typically HTML), into
structured data that can be stored and
analyzed
What is Web Scraping?
● Problem:
○ Static websites
○ No access to APIs to extract the data you
need
○ Need to extract data periodically
● Manual solution - go to the website and copy
the required data
● Smarter solution: Web Scraping
Why we scrape?
● Web pages contain wealth of information (in
text form), designed mostly for human
consumption
● Static websites (legacy systems)
● Interfacing with 3rd party with no API access
● Websites are more important than API’s
● The data is already available (in the form of
web pages)
● No rate limiting
● Anonymous access
Tools for Scraping
● Scrapy
○ Python framework to extract data from webpages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
Getting started!
How do we do it?
Web Scraping in Python
● Download webpage with urllib2, requests
● Parse the page with BeautifulSoup/lxml
● Select with XPath or css selectors
Fetching the data
● Involves finding the endpoint - URL or URL’s
● Sending HTTP requests to the server
● Using requests library:
import requests
data = requests.get(‘http://google.com/’)
html = data.content
Use BeautifulSoup for parsing
● Provides simple methods to-
○ search
○ navigate
○ select
● Deals with broken web-pages really well
● Auto-detects encoding
Philosophy-
“You didn't write that awful page. You're just trying to get
some data out of it. Beautiful Soup is here to help.”
Export the data
● Database (relational or non-relational)
● CSV
● JSON
● File (XML, YAML, etc.)
● API
Challenges
● External sites can change without warning
○ Figuring out the frequency is difficult (TEST, and
test)
○ Changes can break scrapers easily
● Bad HTTP status codes
○ example: using 200 OK to signal an error
○ cannot always trust your HTTP libraries default
behaviour
● Messy HTML markup
Scrapy - a framework for web scraping
● Uses XPath to select elements
● Interactive shell scripting
● Using Scrapy:
○ define a model to store items
○ create your spider to extract items
○ write a Pipeline to store them
Scrapy - fast high Level Screen Scraping
and web crawling Framework
● Uses XPath to select elements
● Interactive shell scripting
● Using Scrapy:
● Pick a website
● Define the data you want to scrape
● Write the spider to extract the data
● Run the spider
● Store the Data
Why Scrapy
● Simplicity
● Fast
● Productive/ Extensible
● Portable
● Well docs & Healthy community
● Commercial Support
Advanced Features (built in)
● Interactive shell for trying XPaths (useful for
debugging)
● selecting and extracting data from html
sources
● cleaning and sanitizing the scraped data
● generating feed exports (JSON, CSV)
● media pipeline for downloading stuff
● Middlewares for (cookies, HTTP
compression, cache, user-agent spoofing,
etc)