SlideShare a Scribd company logo
4
Most read
5
Most read
7
Most read
Scraping Data from the Web using
Scrapy & Beautiful Soup
Nithish Raghunandanan
nithishr@gmail.com
PyData Munich | 8th November 2017
About Me
● MSc. Informatics Student at the Technical University of Munich
○ Focus on Data Science & Software Engineering
● Student Employee at KI labs, part of KI Group
● Love to play with different technologies
● Connect
■ nithishr1
@nithishr
What is Scraping?
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or via APIs
Use Cases
Tools for Scraping
● Scrapy
○ Python framework to extract data from web pages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
Tutorial on Web Scraping in Python
Scraping 101
● Spider
○ A bot that downloads web pages
● robots.txt
○ File present on the server specifying access limits to bots
Pitfalls in Crawling
● Javascript heavy websites
○ Splash plugin
○ Selenium
● Default settings not too friendly to website
owners
○ Inbuilt Auto throttle extension
● Captchas
Why Yellow Pages?
Email Marketing for Customer Acquisition
Email Marketing for Customer Acquisition
Initial Approach
● Buy Email Lists
● Send via 3rd Parties
● Poor Quality
○ Non transparent
○ Generic emails
● Expensive
Crawling
● Scrapy + Beautiful Soup
● Over 500k Emails
● Quality Improvement
○ Categorized into segments
○ Targeted emails
● Cheap
nithishr1
@nithishr
nithishr@gmail.com
Connect
Nithish Raghunandanan
www.ki-labs.com
Resources
● Scrapy Guide
○ https://doc.scrapy.org/en/latest/intro/tutorial.html
● Beautiful Soup Guide
○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/
● Crawling Etiquette
○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/
● Code
○ https://github.com/nithishr/meetup_scraping
Ad

Recommended

Web scraping in python
Web scraping in python
Viren Rajput
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
PromptCloud
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
CynthiaCruz55
 
Web development ppt
Web development ppt
ParasJain222
 
WEB Scraping.pptx
WEB Scraping.pptx
Shubham Jaybhaye
 
The power of creative collaboration
The power of creative collaboration
Table19
 
Five Killer Ways to Design The Same Slide
Five Killer Ways to Design The Same Slide
Crispy Presentations
 
1 - Introduction to PL/SQL
1 - Introduction to PL/SQL
rehaniltifat
 
Web Scraping With Python
Web Scraping With Python
Robert Dempsey
 
What is web scraping?
What is web scraping?
Brijesh Prajapati
 
Web scraping in python
Web scraping in python
Saurav Tomar
 
What is Web-scraping?
What is Web-scraping?
Yu-Chang Ho
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in Python
Satwik Kansal
 
Web Scraping
Web Scraping
Carlos Rodriguez
 
Scraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
 
Web scraping & browser automation
Web scraping & browser automation
BHAWESH RAJPAL
 
Introduction to data science
Introduction to data science
Sampath Kumar
 
Introduction To Machine Learning
Introduction To Machine Learning
Knoldus Inc.
 
Anomaly detection
Anomaly detection
QuantUniversity
 
Introduction to Data Stream Processing
Introduction to Data Stream Processing
Safe Software
 
Introduction
Introduction
neelamoberoi1030
 
Web crawler
Web crawler
poonamkenkre
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
Data Science
Data Science
Amit Singh
 
Introduction to Python for Data Science
Introduction to Python for Data Science
Arc & Codementor
 
Data science presentation
Data science presentation
MSDEVMTL
 
Introduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Big Data & Text Mining
Big Data & Text Mining
Michel Bruley
 
Mr
Mr
Tianwei Liu
 
Linux Introduction (Commands)
Linux Introduction (Commands)
anandvaidya
 

More Related Content

What's hot (20)

Web Scraping With Python
Web Scraping With Python
Robert Dempsey
 
What is web scraping?
What is web scraping?
Brijesh Prajapati
 
Web scraping in python
Web scraping in python
Saurav Tomar
 
What is Web-scraping?
What is Web-scraping?
Yu-Chang Ho
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in Python
Satwik Kansal
 
Web Scraping
Web Scraping
Carlos Rodriguez
 
Scraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
 
Web scraping & browser automation
Web scraping & browser automation
BHAWESH RAJPAL
 
Introduction to data science
Introduction to data science
Sampath Kumar
 
Introduction To Machine Learning
Introduction To Machine Learning
Knoldus Inc.
 
Anomaly detection
Anomaly detection
QuantUniversity
 
Introduction to Data Stream Processing
Introduction to Data Stream Processing
Safe Software
 
Introduction
Introduction
neelamoberoi1030
 
Web crawler
Web crawler
poonamkenkre
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
Data Science
Data Science
Amit Singh
 
Introduction to Python for Data Science
Introduction to Python for Data Science
Arc & Codementor
 
Data science presentation
Data science presentation
MSDEVMTL
 
Introduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Big Data & Text Mining
Big Data & Text Mining
Michel Bruley
 
Web Scraping With Python
Web Scraping With Python
Robert Dempsey
 
Web scraping in python
Web scraping in python
Saurav Tomar
 
What is Web-scraping?
What is Web-scraping?
Yu-Chang Ho
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in Python
Satwik Kansal
 
Scraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
 
Web scraping & browser automation
Web scraping & browser automation
BHAWESH RAJPAL
 
Introduction to data science
Introduction to data science
Sampath Kumar
 
Introduction To Machine Learning
Introduction To Machine Learning
Knoldus Inc.
 
Introduction to Data Stream Processing
Introduction to Data Stream Processing
Safe Software
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
Introduction to Python for Data Science
Introduction to Python for Data Science
Arc & Codementor
 
Data science presentation
Data science presentation
MSDEVMTL
 
Introduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Big Data & Text Mining
Big Data & Text Mining
Michel Bruley
 

Viewers also liked (9)

Mr
Mr
Tianwei Liu
 
Linux Introduction (Commands)
Linux Introduction (Commands)
anandvaidya
 
Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Scraping the web with python
Scraping the web with python
Jose Manuel Ortega Candel
 
Linux File System
Linux File System
Anil Kumar Pugalia
 
Linux.ppt
Linux.ppt
onu9
 
Big Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Web Scraping with Python
Web Scraping with Python
Paul Schreiber
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith
 
Linux Introduction (Commands)
Linux Introduction (Commands)
anandvaidya
 
Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Linux.ppt
Linux.ppt
onu9
 
Big Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Web Scraping with Python
Web Scraping with Python
Paul Schreiber
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith
 
Ad

Similar to Tutorial on Web Scraping in Python (20)

Life of a data engineer
Life of a data engineer
Nithish Raghunandanan
 
Using Web Data for Finance
Using Web Data for Finance
Scrapinghub
 
Python in Industry
Python in Industry
Dharmit Shah
 
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
Lviv Startup Club
 
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
content75
 
Creating data apps using Streamlit in Python
Creating data apps using Streamlit in Python
Nithish Raghunandanan
 
Data science at OLX
Data science at OLX
Alexey Grigorev
 
Django on app engine
Django on app engine
benpotato
 
R vs Python vs SAS
R vs Python vs SAS
Outreach Digital
 
Building Data Apps with Python
Building Data Apps with Python
Benjamin Bengfort
 
Getting started with Scrapy in Python
Getting started with Scrapy in Python
Viren Rajput
 
Computer Science Career Guidance
Computer Science Career Guidance
Deepak Sood
 
Web mining
Web mining
Renusoni8
 
Glowing bear
Glowing bear
thehyve
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04
Torben Brodt
 
Dynatech presentation for TSI Career Day
Dynatech presentation for TSI Career Day
Artur Babyuk
 
Curtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo
羽祈 張
 
LLM-based Multi-Agent Systems to Replace Traditional Software
LLM-based Multi-Agent Systems to Replace Traditional Software
Ivo Andreev
 
Application Presentation
Application Presentation
Nuwantha Fernando
 
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
Dataconomy Media
 
Using Web Data for Finance
Using Web Data for Finance
Scrapinghub
 
Python in Industry
Python in Industry
Dharmit Shah
 
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
Lviv Startup Club
 
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
Rostyslav Chayka: Вступ до штучного інтелекту в управлінні проєктами (UA)
content75
 
Creating data apps using Streamlit in Python
Creating data apps using Streamlit in Python
Nithish Raghunandanan
 
Django on app engine
Django on app engine
benpotato
 
Building Data Apps with Python
Building Data Apps with Python
Benjamin Bengfort
 
Getting started with Scrapy in Python
Getting started with Scrapy in Python
Viren Rajput
 
Computer Science Career Guidance
Computer Science Career Guidance
Deepak Sood
 
Glowing bear
Glowing bear
thehyve
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04
Torben Brodt
 
Dynatech presentation for TSI Career Day
Dynatech presentation for TSI Career Day
Artur Babyuk
 
Curtain call of zooey - what i've learned in yahoo
Curtain call of zooey - what i've learned in yahoo
羽祈 張
 
LLM-based Multi-Agent Systems to Replace Traditional Software
LLM-based Multi-Agent Systems to Replace Traditional Software
Ivo Andreev
 
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
Dataconomy Media
 
Ad

More from Nithish Raghunandanan (10)

Evaluating the Effectiveness of RAG in Real World Applications
Evaluating the Effectiveness of RAG in Real World Applications
Nithish Raghunandanan
 
AI_Photo_Generation_with_Python_A_Developer's_Guide.pdf
AI_Photo_Generation_with_Python_A_Developer's_Guide.pdf
Nithish Raghunandanan
 
Next Generation Apps: Enhancing User Experience with LLMs.pdf
Next Generation Apps: Enhancing User Experience with LLMs.pdf
Nithish Raghunandanan
 
Select ML from Databases.pdf
Select ML from Databases.pdf
Nithish Raghunandanan
 
Select ML from Databases
Select ML from Databases
Nithish Raghunandanan
 
Virtual tourism in covid times
Virtual tourism in covid times
Nithish Raghunandanan
 
Learnings from Organizing Internal Hackathons
Learnings from Organizing Internal Hackathons
Nithish Raghunandanan
 
Learnings from Organizing an Internal Hackathon
Learnings from Organizing an Internal Hackathon
Nithish Raghunandanan
 
Pecha kucha Talk on web scraping
Pecha kucha Talk on web scraping
Nithish Raghunandanan
 
Hodor: Solving Everyday Problems with Tech
Hodor: Solving Everyday Problems with Tech
Nithish Raghunandanan
 
Evaluating the Effectiveness of RAG in Real World Applications
Evaluating the Effectiveness of RAG in Real World Applications
Nithish Raghunandanan
 
AI_Photo_Generation_with_Python_A_Developer's_Guide.pdf
AI_Photo_Generation_with_Python_A_Developer's_Guide.pdf
Nithish Raghunandanan
 
Next Generation Apps: Enhancing User Experience with LLMs.pdf
Next Generation Apps: Enhancing User Experience with LLMs.pdf
Nithish Raghunandanan
 
Learnings from Organizing Internal Hackathons
Learnings from Organizing Internal Hackathons
Nithish Raghunandanan
 
Learnings from Organizing an Internal Hackathon
Learnings from Organizing an Internal Hackathon
Nithish Raghunandanan
 
Hodor: Solving Everyday Problems with Tech
Hodor: Solving Everyday Problems with Tech
Nithish Raghunandanan
 

Recently uploaded (20)

Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
Quantum AI: Where Impossible Becomes Probable
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Josef Weingand
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
Quantum AI: Where Impossible Becomes Probable
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Wenn alles versagt - IBM Tape schützt, was zählt! Und besonders mit dem neust...
Josef Weingand
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 

Tutorial on Web Scraping in Python

  • 1. Scraping Data from the Web using Scrapy & Beautiful Soup Nithish Raghunandanan [email protected] PyData Munich | 8th November 2017
  • 2. About Me ● MSc. Informatics Student at the Technical University of Munich ○ Focus on Data Science & Software Engineering ● Student Employee at KI labs, part of KI Group ● Love to play with different technologies ● Connect ■ nithishr1 @nithishr
  • 3. What is Scraping? ● Extract data from the web pages ● Store the data into structured formats ● Data not available directly or via APIs
  • 5. Tools for Scraping ● Scrapy ○ Python framework to extract data from web pages ● Beautiful Soup ○ Python library to parse HTML/XML documents ● Alternatives ○ Selenium ○ Requests ○ Octoparse
  • 7. Scraping 101 ● Spider ○ A bot that downloads web pages ● robots.txt ○ File present on the server specifying access limits to bots
  • 8. Pitfalls in Crawling ● Javascript heavy websites ○ Splash plugin ○ Selenium ● Default settings not too friendly to website owners ○ Inbuilt Auto throttle extension ● Captchas
  • 9. Why Yellow Pages? Email Marketing for Customer Acquisition
  • 10. Email Marketing for Customer Acquisition Initial Approach ● Buy Email Lists ● Send via 3rd Parties ● Poor Quality ○ Non transparent ○ Generic emails ● Expensive Crawling ● Scrapy + Beautiful Soup ● Over 500k Emails ● Quality Improvement ○ Categorized into segments ○ Targeted emails ● Cheap
  • 12. Resources ● Scrapy Guide ○ https://doc.scrapy.org/en/latest/intro/tutorial.html ● Beautiful Soup Guide ○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/ ● Crawling Etiquette ○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ ● Code ○ https://github.com/nithishr/meetup_scraping