SlideShare a Scribd company logo
2
Most read
4
Most read
21
Most read
Web Scraping with Python
Virendra Rajput,
Hacker @Markitty
Agenda
● What is scraping
● Why we scrape
● My experiments with web scraping
● How do we do it
● Tools to use
● Online demo
● Some more tools
● Ethics for scraping
converting unstructured documents
into structured information
scraping:
What is Web Scraping?
● Web scraping (web harvesting) is a software
technique of extracting information from
websites
● It focuses on transformation of unstructured
data on the web (typically HTML), into
structured data that can be stored and
analyzed
RSS is meta data and not
HTML replacement
Why we scrape?
● Web pages contain wealth of information (in
text form), designed mostly for human
consumption
● Static websites (legacy systems)
● Interfacing with 3rd party with no API access
● Websites are more important than API’s
● The data is already available (in the form of
web pages)
● No rate limiting
● Anonymous access
How search engines use it
My Experiments with Scraping
and more..!
IMDb API
Did you mean!
Facebook Bot for Brahma
Kumaris
Getting started!
Fetching the data
● Involves finding the endpoint - URL or URL’s
● Sending HTTP requests to the server
● Using requests library:
import requests
data = requests.get(‘http://google.com/’)
html = data.content
Processing (say no to Reg-ex)
● use reg-ex
● Avoid using reg-ex
● Reasons why not to use it:
1. Its fragile
2. Really hard to maintain
3. Improper HTML & Encoding handling
Use BeautifulSoup for parsing
● Provides simple methods to-
○ search
○ navigate
○ select
● Deals with broken web-pages really well
● Auto-detects encoding
Philosophy-
“You didn't write that awful page. You're just trying to get
some data out of it. Beautiful Soup is here to help.”
Export the data
● Database (relational or non-relational)
● CSV
● JSON
● File (XML, YAML, etc.)
● API
Live example demo
Challenges
● External sites can change without warning
○ Figuring out the frequency is difficult (TEST, and
test)
○ Changes can break scrapers easily
● Bad HTTP status codes
○ example: using 200 OK to signal an error
○ cannot always trust your HTTP libraries default
behaviour
● Messy HTML markup
Mechanize
● Stateful web-browsing with
mechanize
○ Fill up forms
○ Follow links
○ Handle cookies
○ Browse history
● After Andy Lester’s WWW:
Mechanize
Filling forms with Mechanize
Scrapy - a framework for web
scraping
● Uses XPath to select elements
● Interactive shell scripting
● Using Scrapy:
○ define a model to store items
○ create your spider to extract items
○ write a Pipeline to store them
Conclusion
● Scrape wisely
● Do not steal
● Use cloud
● Share your scrapers scraperwiki.com
The End!
Virendra Rajput
http://virendra.me/
http://twitter.com/bkvirendra

More Related Content

PPTX
Web scraping
PDF
Tutorial on Web Scraping in Python
PPTX
Web Scraping using Python | Web Screen Scraping
ODP
Introduction to Web Scraping using Python and Beautiful Soup
PDF
Intro to web scraping with Python
PDF
SELENIUM PPT.pdf
PPTX
Color fundamentals and color models - Digital Image Processing
PPTX
Python and its Applications
Web scraping
Tutorial on Web Scraping in Python
Web Scraping using Python | Web Screen Scraping
Introduction to Web Scraping using Python and Beautiful Soup
Intro to web scraping with Python
SELENIUM PPT.pdf
Color fundamentals and color models - Digital Image Processing
Python and its Applications

What's hot (20)

PPTX
Web Scraping With Python
PPTX
WEB Scraping.pptx
PPT
Web Scraping and Data Extraction Service
PDF
What is web scraping?
PPTX
Web Scraping Basics
PDF
Getting started with Web Scraping in Python
PPTX
Introduction to Django
PPTX
Classification in data mining
PDF
Introduction to Machine Learning with SciKit-Learn
PDF
Python for Data Science
PDF
Scraping data from the web and documents
PDF
Introduction to Data Stream Processing
PPTX
Neural network
PPTX
Web mining (structure mining)
PDF
Lecture6 introduction to data streams
PPTX
Forward and Backward chaining in AI
PPT
Web crawler
PPT
Pagerank Algorithm Explained
PPTX
Front end web development
PPTX
Uncertainty in AI
Web Scraping With Python
WEB Scraping.pptx
Web Scraping and Data Extraction Service
What is web scraping?
Web Scraping Basics
Getting started with Web Scraping in Python
Introduction to Django
Classification in data mining
Introduction to Machine Learning with SciKit-Learn
Python for Data Science
Scraping data from the web and documents
Introduction to Data Stream Processing
Neural network
Web mining (structure mining)
Lecture6 introduction to data streams
Forward and Backward chaining in AI
Web crawler
Pagerank Algorithm Explained
Front end web development
Uncertainty in AI
Ad

Similar to Web scraping in python (20)

PDF
Getting started with Scrapy in Python
PDF
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
PPTX
Scrappy
PPTX
Web scraping
PPTX
Scraping the web with Laravel, Dusk, Docker, and PHP
PDF
Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018)
PDF
Python in Industry
PPTX
Web stats
PPTX
Web mining
PDF
Search engine and web crawler
PDF
An EyeWitness View into your Network
PPTX
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
PDF
What You Need to Know About Technical SEO
PDF
Presentation 10all
PDF
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
PPTX
Big data at scrapinghub
PPTX
Destination Documentation: How Not to Get Lost in Your Org
DOCX
Unit 2_Crawling a website data collection, search engine indexing, and cybers...
PDF
SEO for Large/Enterprise Websites - Data & Tech Side
PPTX
Week 1 - Interactive News Editing and Producing
Getting started with Scrapy in Python
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Scrappy
Web scraping
Scraping the web with Laravel, Dusk, Docker, and PHP
Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018)
Python in Industry
Web stats
Web mining
Search engine and web crawler
An EyeWitness View into your Network
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
What You Need to Know About Technical SEO
Presentation 10all
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Big data at scrapinghub
Destination Documentation: How Not to Get Lost in Your Org
Unit 2_Crawling a website data collection, search engine indexing, and cybers...
SEO for Large/Enterprise Websites - Data & Tech Side
Week 1 - Interactive News Editing and Producing
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Spectroscopy.pptx food analysis technology
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Sensors and Actuators in IoT Systems using pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
Spectroscopy.pptx food analysis technology
madgavkar20181017ppt McKinsey Presentation.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dropbox Q2 2025 Financial Results & Investor Presentation
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
GamePlan Trading System Review: Professional Trader's Honest Take
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced Soft Computing BINUS July 2025.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Monthly Chronicles - July 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Sensors and Actuators in IoT Systems using pdf

Web scraping in python

  • 1. Web Scraping with Python Virendra Rajput, Hacker @Markitty
  • 2. Agenda ● What is scraping ● Why we scrape ● My experiments with web scraping ● How do we do it ● Tools to use ● Online demo ● Some more tools ● Ethics for scraping
  • 3. converting unstructured documents into structured information scraping:
  • 4. What is Web Scraping? ● Web scraping (web harvesting) is a software technique of extracting information from websites ● It focuses on transformation of unstructured data on the web (typically HTML), into structured data that can be stored and analyzed
  • 5. RSS is meta data and not HTML replacement
  • 6. Why we scrape? ● Web pages contain wealth of information (in text form), designed mostly for human consumption ● Static websites (legacy systems) ● Interfacing with 3rd party with no API access ● Websites are more important than API’s ● The data is already available (in the form of web pages) ● No rate limiting ● Anonymous access
  • 9. and more..! IMDb API Did you mean! Facebook Bot for Brahma Kumaris
  • 11. Fetching the data ● Involves finding the endpoint - URL or URL’s ● Sending HTTP requests to the server ● Using requests library: import requests data = requests.get(‘http://google.com/’) html = data.content
  • 12. Processing (say no to Reg-ex) ● use reg-ex ● Avoid using reg-ex ● Reasons why not to use it: 1. Its fragile 2. Really hard to maintain 3. Improper HTML & Encoding handling
  • 13. Use BeautifulSoup for parsing ● Provides simple methods to- ○ search ○ navigate ○ select ● Deals with broken web-pages really well ● Auto-detects encoding Philosophy- “You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help.”
  • 14. Export the data ● Database (relational or non-relational) ● CSV ● JSON ● File (XML, YAML, etc.) ● API
  • 16. Challenges ● External sites can change without warning ○ Figuring out the frequency is difficult (TEST, and test) ○ Changes can break scrapers easily ● Bad HTTP status codes ○ example: using 200 OK to signal an error ○ cannot always trust your HTTP libraries default behaviour ● Messy HTML markup
  • 17. Mechanize ● Stateful web-browsing with mechanize ○ Fill up forms ○ Follow links ○ Handle cookies ○ Browse history ● After Andy Lester’s WWW: Mechanize
  • 18. Filling forms with Mechanize
  • 19. Scrapy - a framework for web scraping ● Uses XPath to select elements ● Interactive shell scripting ● Using Scrapy: ○ define a model to store items ○ create your spider to extract items ○ write a Pipeline to store them
  • 20. Conclusion ● Scrape wisely ● Do not steal ● Use cloud ● Share your scrapers scraperwiki.com