SlideShare a Scribd company logo
4
Most read
6
Most read
8
Most read
Scrapingtotherescue
(Webscrapingusingpython)
By : Satwik Kansal and Pradhvan Bisht
Whatiswebscraping ?
Web scraping is a technique to extract large amounts of
data from websites whereby the data is extracted and
saved to a local file in your computer.
The data can be used for several purposes like displaying on
your own website and application, performing data analysis
or for any other reason.
Getting started with Web Scraping in Python
whyshouldyouscrape
- API may not provide what you need
- No rate limit
- Take what you really want!
- Reduces manual effort
- Swag!
Thingsthatmightcomehandy
-HTML
-CSS
-XPATH
-Regular Expressions
Howitโ€™sdone?
Broadly a Three Step Process
1. Getting the content (in most cases HTML)
2. Parsing the response.
3. Optimizing/Improving the performance and preserving the data
GETTINGTHECONTENT
โ— Using modules like urllib, urllib2, requests, mechanize and selenium.
โ— Involves GET/POST request to the server.
โ— The response contains the information to be extracted.
โ— Sometimes not as easy as it may seem.
ExtractingTheData
1. Using Regular Expression and Basic python
Tricky, complex and kind of fragile.
2. Using Parsing Libraries
โ Two different approaches possible -- Simple Parsing and Search Tree
parsing.
โ Some popular libraries are BeautifulSoup, Lxml, and html5lib.
โ Each modules has its own techniques and thus its own pros and trade-
offs
Getting started with Web Scraping in Python
ComparingParsers
BEAUTIFUL SOUP
LXML
SCRAPY
HTML5LIB
PreservingTheData
1. Writing to a file.
2. Exporting as csv or excel file.
3. Storing in a database.
Examples
Example 1 : Scraping Tweets from Twitter using BeautifulSoup
and pythonโ€™s Requests module
Code
Example 2 : Scraping top Stackoverflow posts using Scrapy
Code
Example 3 : Using Selenium to Log in and fetch library
details from a university library site which uses Dynamic
HTML.
Getting started with Web Scraping in Python
WHATTOUSEWHERE
1. Handling dynamically generated html
Solutions: Selenium or Spidermonkey
2. Cookie based Authentication
Solution : Requests module.
3. Simple scraping
Solutions: BeautifulSoup+Requests, Scrapy, Selenium
Getting started with Web Scraping in Python
Scrapinghacks
1. Overcoming captchas
Lookup tables, One time manual entry , Death By Captchas (paid service)
2. Per IP address query limit
Using tsocks, ssh_D and socks monkey.
3. Improving performance
Multiprocessing , gevent and requests.async() method.
Example3
Automating My College Library
Problems :
1. Authentication
2. Dynamically Generated <iframe> tag
Solution
Selenium with headless Browser like PhantomJS
Alternative: Mechanize
Code
Getting started with Web Scraping in Python
EthicsOfScraping
Exceeding authorized use of the site
Means doing anything that is prohibited in the Terms of Use
(See CFAA, breach of contract, unjust enrichment, trespass
to chattels, and various state laws similar to CFAA)
Copyright Issues
If the material you are scraping is not factual, but
something that required some amount of creativity to create,
you have copyright to worry about.
QuickTip -- Conform to the the robots.txt file.
Getting started with Web Scraping in Python
โ— The brute-force way to get the information required.
โ— Absolutely Legal
โ— Not always that easy.
Ad

Recommended

Web Scraping With Python
Web Scraping With Python
Robert Dempsey
ย 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
Nithish Raghunandanan
ย 
Intro to web scraping with Python
Intro to web scraping with Python
Maris Lemba
ย 
Web scraping in python
Web scraping in python
Saurav Tomar
ย 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
Tushar Mittal
ย 
Web Scraping
Web Scraping
Carlos Rodriguez
ย 
What is web scraping?
What is web scraping?
Brijesh Prajapati
ย 
Web scraping in python
Web scraping in python
Viren Rajput
ย 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
CynthiaCruz55
ย 
Web Scraping Basics
Web Scraping Basics
Kyle Banerjee
ย 
Web scraping
Web scraping
Selecto
ย 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
PromptCloud
ย 
Scraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
ย 
What is Web-scraping?
What is Web-scraping?
Yu-Chang Ho
ย 
Web scraping
Web scraping
Ashley Davis
ย 
Web scraping &amp; browser automation
Web scraping &amp; browser automation
BHAWESH RAJPAL
ย 
Web mining
Web mining
MohamadHayeri1
ย 
Machine Learning
Machine Learning
Vivek Garg
ย 
Web mining
Web mining
Tanjarul Islam Mishu
ย 
Web Scraping
Web Scraping
primeteacher32
ย 
Introduction to Text Mining
Introduction to Text Mining
Minha Hwang
ย 
Introduction to Data Engineering
Introduction to Data Engineering
Hadi Fadlallah
ย 
WEB Scraping.pptx
WEB Scraping.pptx
Shubham Jaybhaye
ย 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
School of Data
ย 
Machine learning
Machine learning
Saurabh Agrawal
ย 
[MLOps KR ํ–‰์‚ฌ] MLOps ์ถ˜์ถ” ์ „๊ตญ ์‹œ๋Œ€ ์ •๋ฆฌ(210605)
[MLOps KR ํ–‰์‚ฌ] MLOps ์ถ˜์ถ” ์ „๊ตญ ์‹œ๋Œ€ ์ •๋ฆฌ(210605)
Seongyun Byeon
ย 
[236] แ„แ…กแ„แ…กแ„‹แ…ฉแ„‹แ…ดแ„ƒแ…ฆแ„‹แ…ตแ„แ…ฅแ„‘แ…กแ„‹แ…ตแ„‘แ…ณแ„…แ…กแ„‹แ…ตแ†ซ แ„‹แ…ฒแ†ซแ„ƒแ…ฉแ„‹แ…งแ†ผ
[236] แ„แ…กแ„แ…กแ„‹แ…ฉแ„‹แ…ดแ„ƒแ…ฆแ„‹แ…ตแ„แ…ฅแ„‘แ…กแ„‹แ…ตแ„‘แ…ณแ„…แ…กแ„‹แ…ตแ†ซ แ„‹แ…ฒแ†ซแ„ƒแ…ฉแ„‹แ…งแ†ผ
NAVER D2
ย 
Machine learning
Machine learning
eonx_32
ย 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
ย 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
ย 

More Related Content

What's hot (20)

Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
CynthiaCruz55
ย 
Web Scraping Basics
Web Scraping Basics
Kyle Banerjee
ย 
Web scraping
Web scraping
Selecto
ย 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
PromptCloud
ย 
Scraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
ย 
What is Web-scraping?
What is Web-scraping?
Yu-Chang Ho
ย 
Web scraping
Web scraping
Ashley Davis
ย 
Web scraping &amp; browser automation
Web scraping &amp; browser automation
BHAWESH RAJPAL
ย 
Web mining
Web mining
MohamadHayeri1
ย 
Machine Learning
Machine Learning
Vivek Garg
ย 
Web mining
Web mining
Tanjarul Islam Mishu
ย 
Web Scraping
Web Scraping
primeteacher32
ย 
Introduction to Text Mining
Introduction to Text Mining
Minha Hwang
ย 
Introduction to Data Engineering
Introduction to Data Engineering
Hadi Fadlallah
ย 
WEB Scraping.pptx
WEB Scraping.pptx
Shubham Jaybhaye
ย 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
School of Data
ย 
Machine learning
Machine learning
Saurabh Agrawal
ย 
[MLOps KR ํ–‰์‚ฌ] MLOps ์ถ˜์ถ” ์ „๊ตญ ์‹œ๋Œ€ ์ •๋ฆฌ(210605)
[MLOps KR ํ–‰์‚ฌ] MLOps ์ถ˜์ถ” ์ „๊ตญ ์‹œ๋Œ€ ์ •๋ฆฌ(210605)
Seongyun Byeon
ย 
[236] แ„แ…กแ„แ…กแ„‹แ…ฉแ„‹แ…ดแ„ƒแ…ฆแ„‹แ…ตแ„แ…ฅแ„‘แ…กแ„‹แ…ตแ„‘แ…ณแ„…แ…กแ„‹แ…ตแ†ซ แ„‹แ…ฒแ†ซแ„ƒแ…ฉแ„‹แ…งแ†ผ
[236] แ„แ…กแ„แ…กแ„‹แ…ฉแ„‹แ…ดแ„ƒแ…ฆแ„‹แ…ตแ„แ…ฅแ„‘แ…กแ„‹แ…ตแ„‘แ…ณแ„…แ…กแ„‹แ…ตแ†ซ แ„‹แ…ฒแ†ซแ„ƒแ…ฉแ„‹แ…งแ†ผ
NAVER D2
ย 
Machine learning
Machine learning
eonx_32
ย 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
CynthiaCruz55
ย 
Web Scraping Basics
Web Scraping Basics
Kyle Banerjee
ย 
Web scraping
Web scraping
Selecto
ย 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
PromptCloud
ย 
Scraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
ย 
What is Web-scraping?
What is Web-scraping?
Yu-Chang Ho
ย 
Web scraping
Web scraping
Ashley Davis
ย 
Web scraping &amp; browser automation
Web scraping &amp; browser automation
BHAWESH RAJPAL
ย 
Machine Learning
Machine Learning
Vivek Garg
ย 
Introduction to Text Mining
Introduction to Text Mining
Minha Hwang
ย 
Introduction to Data Engineering
Introduction to Data Engineering
Hadi Fadlallah
ย 
WEB Scraping.pptx
WEB Scraping.pptx
Shubham Jaybhaye
ย 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
School of Data
ย 
Machine learning
Machine learning
Saurabh Agrawal
ย 
[MLOps KR ํ–‰์‚ฌ] MLOps ์ถ˜์ถ” ์ „๊ตญ ์‹œ๋Œ€ ์ •๋ฆฌ(210605)
[MLOps KR ํ–‰์‚ฌ] MLOps ์ถ˜์ถ” ์ „๊ตญ ์‹œ๋Œ€ ์ •๋ฆฌ(210605)
Seongyun Byeon
ย 
[236] แ„แ…กแ„แ…กแ„‹แ…ฉแ„‹แ…ดแ„ƒแ…ฆแ„‹แ…ตแ„แ…ฅแ„‘แ…กแ„‹แ…ตแ„‘แ…ณแ„…แ…กแ„‹แ…ตแ†ซ แ„‹แ…ฒแ†ซแ„ƒแ…ฉแ„‹แ…งแ†ผ
[236] แ„แ…กแ„แ…กแ„‹แ…ฉแ„‹แ…ดแ„ƒแ…ฆแ„‹แ…ตแ„แ…ฅแ„‘แ…กแ„‹แ…ตแ„‘แ…ณแ„…แ…กแ„‹แ…ตแ†ซ แ„‹แ…ฒแ†ซแ„ƒแ…ฉแ„‹แ…งแ†ผ
NAVER D2
ย 
Machine learning
Machine learning
eonx_32
ย 

Similar to Getting started with Web Scraping in Python (20)

Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
ย 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
ย 
Scrapy
Scrapy
Francisco Sousa
ย 
Web_Scraping_Presentation_today pptx.pptx
Web_Scraping_Presentation_today pptx.pptx
YuvrajTkd
ย 
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
ย 
Python ScrapingPresentation for dummy.pptx
Python ScrapingPresentation for dummy.pptx
norel46453
ย 
Web Scrapping Using Python
Web Scrapping Using Python
ComputerScienceJunct
ย 
Web programming using python frameworks.
Web programming using python frameworks.
Puneet Kumar Bhatia (MBA, ITIL V3 Certified)
ย 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton
ย 
Pydata-Python tools for webscraping
Pydata-Python tools for webscraping
Jose Manuel Ortega Candel
ย 
Scrapy talk at DataPhilly
Scrapy talk at DataPhilly
obdit
ย 
Null 1
Null 1
MarcosHuenchullanSot
ย 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
Data Scraping and Data Extraction
ย 
Scrapinghub PyCon Philippines 2015
Scrapinghub PyCon Philippines 2015
Richard Dowinton
ย 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
Michelle Minkoff
ย 
Web scrapingpanel
Web scrapingpanel
Michelle Minkoff
ย 
Getting started with Scrapy in Python
Getting started with Scrapy in Python
Viren Rajput
ย 
Scrappy
Scrappy
Vishwas N
ย 
Weather data analysis presentation .pptx
Weather data analysis presentation .pptx
YuvrajTkd
ย 
Scrapy.for.dummies
Scrapy.for.dummies
Chandler Huang
ย 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
LITTINRAJAN
ย 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
ย 
Web_Scraping_Presentation_today pptx.pptx
Web_Scraping_Presentation_today pptx.pptx
YuvrajTkd
ย 
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
ย 
Python ScrapingPresentation for dummy.pptx
Python ScrapingPresentation for dummy.pptx
norel46453
ย 
Web Scrapping Using Python
Web Scrapping Using Python
ComputerScienceJunct
ย 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
Anton
ย 
Pydata-Python tools for webscraping
Pydata-Python tools for webscraping
Jose Manuel Ortega Candel
ย 
Scrapy talk at DataPhilly
Scrapy talk at DataPhilly
obdit
ย 
Scrapinghub PyCon Philippines 2015
Scrapinghub PyCon Philippines 2015
Richard Dowinton
ย 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
Michelle Minkoff
ย 
Web scrapingpanel
Web scrapingpanel
Michelle Minkoff
ย 
Getting started with Scrapy in Python
Getting started with Scrapy in Python
Viren Rajput
ย 
Scrappy
Scrappy
Vishwas N
ย 
Weather data analysis presentation .pptx
Weather data analysis presentation .pptx
YuvrajTkd
ย 
Scrapy.for.dummies
Scrapy.for.dummies
Chandler Huang
ย 
Ad

Recently uploaded (20)

Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
ย 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
ย 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
ย 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
ย 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
ย 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
ย 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
ย 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
ย 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
ย 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
ย 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
ย 
Quantum AI: Where Impossible Becomes Probable
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
ย 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
ย 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
ย 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
ย 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
ย 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
ย 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
ย 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
ย 
Mastering AI Workflows with FME by Mark Doฬˆring
Mastering AI Workflows with FME by Mark Doฬˆring
Safe Software
ย 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
ย 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
ย 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
ย 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
ย 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
ย 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
ย 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
ย 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
ย 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
ย 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
ย 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
ย 
Quantum AI: Where Impossible Becomes Probable
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
ย 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
ย 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
ย 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
ย 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
ย 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
ย 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
ย 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
ย 
Mastering AI Workflows with FME by Mark Doฬˆring
Mastering AI Workflows with FME by Mark Doฬˆring
Safe Software
ย 
Ad

Getting started with Web Scraping in Python

  • 2. Whatiswebscraping ? Web scraping is a technique to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer. The data can be used for several purposes like displaying on your own website and application, performing data analysis or for any other reason.
  • 4. whyshouldyouscrape - API may not provide what you need - No rate limit - Take what you really want! - Reduces manual effort - Swag!
  • 6. Howitโ€™sdone? Broadly a Three Step Process 1. Getting the content (in most cases HTML) 2. Parsing the response. 3. Optimizing/Improving the performance and preserving the data
  • 7. GETTINGTHECONTENT โ— Using modules like urllib, urllib2, requests, mechanize and selenium. โ— Involves GET/POST request to the server. โ— The response contains the information to be extracted. โ— Sometimes not as easy as it may seem.
  • 8. ExtractingTheData 1. Using Regular Expression and Basic python Tricky, complex and kind of fragile. 2. Using Parsing Libraries โ Two different approaches possible -- Simple Parsing and Search Tree parsing. โ Some popular libraries are BeautifulSoup, Lxml, and html5lib. โ Each modules has its own techniques and thus its own pros and trade- offs
  • 11. PreservingTheData 1. Writing to a file. 2. Exporting as csv or excel file. 3. Storing in a database.
  • 12. Examples Example 1 : Scraping Tweets from Twitter using BeautifulSoup and pythonโ€™s Requests module Code Example 2 : Scraping top Stackoverflow posts using Scrapy Code Example 3 : Using Selenium to Log in and fetch library details from a university library site which uses Dynamic HTML.
  • 14. WHATTOUSEWHERE 1. Handling dynamically generated html Solutions: Selenium or Spidermonkey 2. Cookie based Authentication Solution : Requests module. 3. Simple scraping Solutions: BeautifulSoup+Requests, Scrapy, Selenium
  • 16. Scrapinghacks 1. Overcoming captchas Lookup tables, One time manual entry , Death By Captchas (paid service) 2. Per IP address query limit Using tsocks, ssh_D and socks monkey. 3. Improving performance Multiprocessing , gevent and requests.async() method.
  • 17. Example3 Automating My College Library Problems : 1. Authentication 2. Dynamically Generated <iframe> tag Solution Selenium with headless Browser like PhantomJS Alternative: Mechanize Code
  • 19. EthicsOfScraping Exceeding authorized use of the site Means doing anything that is prohibited in the Terms of Use (See CFAA, breach of contract, unjust enrichment, trespass to chattels, and various state laws similar to CFAA) Copyright Issues If the material you are scraping is not factual, but something that required some amount of creativity to create, you have copyright to worry about. QuickTip -- Conform to the the robots.txt file.
  • 21. โ— The brute-force way to get the information required. โ— Absolutely Legal โ— Not always that easy.