Web Scraping Using Python

The document provides an overview of web scraping using Python, detailing its definition, purpose, and methods. It highlights tools like Scrapy and Beautiful Soup for extracting and structuring data from web pages, as well as the challenges faced during the scraping process. Additionally, it discusses the advantages of using Scrapy as a framework for efficient web scraping and data management.

Uploaded by

sahil.y.prince

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views18 pages

Web Scraping Using Python

Uploaded by

sahil.y.prince

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Web Scraping with Python

•Dr Vatan Sehrawat

•Asst. Professor, Computer Sc. & Engg. Department
•RBS-SIET Zainabad
•[email protected]
•8059211113
● What is scraping
● Why we scrape
● How do we do it
● Challenges
● Scrapy
scraping
converting unstructured documents
into structured information
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or viaAPIs
What is Web Scraping?

● Web scraping (web harvesting) is a software

technique of extracting information from
websites
● It focuses on transformation of unstructured
data on the web (typically HTML), into
structured data that can be stored and
analyzed
What is Web Scraping?
● Problem:
○ Static websites
○ No access to APIs to extract the data you
need
○ Need to extract data periodically
● Manual solution - go to the website and copy
the required data
● Smarter solution: Web Scraping
Why we scrape?

● Web pages contain wealth of information (in

text form), designed mostly for human
consumption
● Static websites (legacy systems)
● Interfacing with 3rd party with no API access
● Websites are more important than API’s
● The data is already available (in the form of
web pages)
● No rate limiting
● Anonymous access
Tools for Scraping

● Scrapy
○ Python framework to extract data from webpages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
Getting started!

How do we do it?
Web Scraping in Python
● Download webpage with urllib2, requests

● Parse the page with BeautifulSoup/lxml

● Select with XPath or css selectors

Fetching the data

● Involves finding the endpoint - URL or URL’s

● Sending HTTP requests to the server
● Using requests library:

import requests

data = requests.get(‘http://google.com/’)

html = data.content
Use BeautifulSoup for parsing

● Provides simple methods to-

○ search
○ navigate
○ select
● Deals with broken web-pages really well
● Auto-detects encoding

Philosophy-
“You didn't write that awful page. You're just trying to get
some data out of it. Beautiful Soup is here to help.”
Export the data

● Database (relational or non-relational)

● CSV
● JSON
● File (XML, YAML, etc.)
● API
Challenges

● External sites can change without warning

○ Figuring out the frequency is difficult (TEST, and
test)
○ Changes can break scrapers easily
● Bad HTTP status codes
○ example: using 200 OK to signal an error
○ cannot always trust your HTTP libraries default
behaviour
● Messy HTML markup
Scrapy - a framework for web scraping

● Uses XPath to select elements

● Interactive shell scripting
● Using Scrapy:
○ define a model to store items
○ create your spider to extract items
○ write a Pipeline to store them
Scrapy - fast high Level Screen Scraping
and web crawling Framework

● Uses XPath to select elements

● Interactive shell scripting
● Using Scrapy:
● Pick a website
● Define the data you want to scrape
● Write the spider to extract the data
● Run the spider
● Store the Data
Why Scrapy
● Simplicity
● Fast
● Productive/ Extensible
● Portable
● Well docs & Healthy community
● Commercial Support
Advanced Features (built in)
● Interactive shell for trying XPaths (useful for
debugging)
● selecting and extracting data from html
sources
● cleaning and sanitizing the scraped data
● generating feed exports (JSON, CSV)
● media pipeline for downloading stuff
● Middlewares for (cookies, HTTP
compression, cache, user-agent spoofing,
etc)

Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Mian Muhammad Nawaz Sharif
No ratings yet
Mian Muhammad Nawaz Sharif
134 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
Web Scraping With Python Tutorials From A To Z
100% (2)
Web Scraping With Python Tutorials From A To Z
35 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Download
No ratings yet
Download
4 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Webscraping
No ratings yet
Webscraping
12 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Scraping
100% (1)
Scraping
25 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
No ratings yet
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
6 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Module 4
No ratings yet
Module 4
14 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping With BeautifulSoup
100% (1)
Web Scraping With BeautifulSoup
8 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Web Scraping in Python Using Scrapy
No ratings yet
Web Scraping in Python Using Scrapy
30 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Learning Scrapy - Sample Chapter
0% (1)
Learning Scrapy - Sample Chapter
16 pages
Web Data Scraping
No ratings yet
Web Data Scraping
5 pages
Q-1 Web Scraping: Definition and Significance
No ratings yet
Q-1 Web Scraping: Definition and Significance
4 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Web Scraping
No ratings yet
Web Scraping
7 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
Pre Final Exam Entrepreneurship
No ratings yet
Pre Final Exam Entrepreneurship
2 pages
Miller The Bahai Faith Its History and Teachings
100% (1)
Miller The Bahai Faith Its History and Teachings
371 pages
Avaya IX WEM 2024R1 Installation Guide
No ratings yet
Avaya IX WEM 2024R1 Installation Guide
105 pages
Irrefutable Proof That Hanbalis Were Sufis
No ratings yet
Irrefutable Proof That Hanbalis Were Sufis
10 pages
Family Is Forever
No ratings yet
Family Is Forever
1 page
Form of Ownership Chosen and Reasoning
No ratings yet
Form of Ownership Chosen and Reasoning
14 pages
Series134 DualAdvantage 1.5yrs
No ratings yet
Series134 DualAdvantage 1.5yrs
3 pages
Chapter 6 Summary - Supply, Demand & Government Policies
No ratings yet
Chapter 6 Summary - Supply, Demand & Government Policies
2 pages
Discoverumrah Com
No ratings yet
Discoverumrah Com
5 pages
MARAM Alignment Checklist - 0
No ratings yet
MARAM Alignment Checklist - 0
3 pages
Proposition - Debate Was The Treaty of Versailles Fair
No ratings yet
Proposition - Debate Was The Treaty of Versailles Fair
2 pages
Political Science Class 11th Re Examination
No ratings yet
Political Science Class 11th Re Examination
3 pages
Madame Tussaud by Michelle Moran - Excerpt With Bonus Content
19% (216)
Madame Tussaud by Michelle Moran - Excerpt With Bonus Content
33 pages
Mototrbo™: DEM 500 Mobile Radios
0% (1)
Mototrbo™: DEM 500 Mobile Radios
10 pages
Project Report-Study On Credit Appraisal
100% (1)
Project Report-Study On Credit Appraisal
57 pages
Proof of Employment Letter 07
No ratings yet
Proof of Employment Letter 07
1 page
Bpo 6
No ratings yet
Bpo 6
12 pages
Internship Report On Allied Bank
No ratings yet
Internship Report On Allied Bank
83 pages
Surrender Deed
No ratings yet
Surrender Deed
3 pages
LinuxCBT Moni-Zab Edition Classroom Notes
No ratings yet
LinuxCBT Moni-Zab Edition Classroom Notes
5 pages
GEC 9 The Life and Works of Rizal
No ratings yet
GEC 9 The Life and Works of Rizal
14 pages
Clay - Smith@ag - Idaho.gov: in The United States District Court For The District of Idaho
No ratings yet
Clay - Smith@ag - Idaho.gov: in The United States District Court For The District of Idaho
3 pages
For Retail / Manufacturing Industry
No ratings yet
For Retail / Manufacturing Industry
4 pages
PROBLEMS AvogadrosLawIdealGasLawStoich
0% (1)
PROBLEMS AvogadrosLawIdealGasLawStoich
2 pages
Legal Fictions Law Text
No ratings yet
Legal Fictions Law Text
51 pages
F13 Noble v. Abaja
No ratings yet
F13 Noble v. Abaja
2 pages
Bible - Is Illness Sometimes A Result of A Sin - Christianity Stack Exchange
No ratings yet
Bible - Is Illness Sometimes A Result of A Sin - Christianity Stack Exchange
8 pages
View Bill
No ratings yet
View Bill
1 page
Avaya SBCE Deploying On An AWS Platform 8.1.x December 2020
No ratings yet
Avaya SBCE Deploying On An AWS Platform 8.1.x December 2020
64 pages