SlideShare a Scribd company logo
Scraping Data from
Documents and the Web
Tommy Tavenner
National Wildlife Federation
What is it?
© 2014 Tommy Tavenner
What is Scraping?
• Converting data from human readable into machine readable
• This data is sometimes referred to as ‘unstructured’ but is really
just not structured properly for systematic parsing
• The data is often embedded in layers of formatting meta data.
Think HTML or PDF formatting like font colors and tables.
• The job of the scraper is to separate the data from the
formatting. In some cases even using the formatting to interpret
the data.
© 2014 Tommy Tavenner
Is it Legal?
© 2014 Tommy Tavenner
Maybe!
© 2014 Tommy Tavenner
Is Scraping Legal?
• It depends
• Most publically available data in the US falls within the sphere
of copyright protection.
> Creativity in producing the source data
> The manner in which the data is presented
> Fair Use on the web
• What is the purpose of the scraping?
© 2014 Tommy Tavenner
Is Scraping Legal?
• Terms of Service
> Does it explicitly prohibit scraping?
> Does it prohibit storing information privately?
© 2014 Tommy Tavenner
Is Scraping Legal?
• Feist v. Rural Telephone (1991)
> Feist, a phone book compiler in Kansas, copied the contents of
Rural Telephone’s directory after Rural refused to license the
information.
> Rural sued Feist for copyright infringement. Because of the nature
of the information, the case eventually made it to the supreme
court.
> The case centered on originality and whether compiling facts
constitutes an original work.
> The court ruled that the phone directory did not constitute and
original compilation because no discretion was exercised in
deciding on contents.
© 2014 Tommy Tavenner
Is Scraping Legal?
• LinkedIn case (2014)
> Suing a group of unknown defendants in California.
> LinkedIn alleges that this group used a series of bots and fake
profiles on the site to scrape content from other member profiles
> The case is based on the Digital Millennium Copyright Act.
© 2014 Tommy Tavenner
Jargon
• Spider – Searches for links within content and follows, building
up a site map or web of content.
• Crawler – Synonym for Spider
• Training Data – Like in supervised machine learning, training
data is used to teach a spider how to interpret the content they
will be processing.
• IP Proxy/Switching – Regular switching of IP address used to
bypass restrictions on the number of connections per client set
by web servers. May be a sign of less than legal or honorable
intent in scraping.
© 2014 Tommy Tavenner
Anatomy of a Scraper
Document Load
• Pull in the
complete web
page, PDF, XML,
etc.
Parsing
• Parse the HTML,
XML, or PDF meta
data into
something the
script can
understand
Extraction
• Use the results of
parsing to extract
the data we are
looking for
Transformation
•Convert the
data into
useful formats,
i.e. currency,
dates, etc.
© 2014 Tommy Tavenner
Anatomy of a Scraper
Document
Load
• Load the entire document or HTML
page. Generally as a string of
characters.
• For larger documents this may involve
splitting it into multiple pages
© 2014 Tommy Tavenner
Anatomy of a Scraper
Parsing
• Interpret the document to make searching
possible.
• Biggest potential failure point
• Specific to the source data.
• HTML Document Object Model
• PDF Grid Model
© 2014 Tommy Tavenner
Anatomy of a Scraper
Extraction
• Search parsed data for particular
pieces of information
• i.e. file name, link, or table
• Separate data into individual pieces for
later processing
© 2014 Tommy Tavenner
Anatomy of a Scraper
Transformation
• Convert data into proper output
• Apply standards
• Change type
• i.e. date string date
© 2014 Tommy Tavenner
Visual Scraping tools
• Require no programming knowledge
• Primarily web-based
• Allow quick access to data
• Because they are not bespoke may require more scrubbing of
the data after scraping
© 2014 Tommy Tavenner
ScraperWiki
• Paid Service with very basic free plan
• Focused on table extraction and Twitter data
• Takes a single page or document as its source
© 2014 Tommy Tavenner
ScraperWiki
• Allows you to quickly access the data or summarize it.
• Works well with PDF’s of tables but struggles with mixed data.
© 2014 Tommy Tavenner
Import.io
• In early stages, currently free with professional accounts
• Downloadable Java app – multi-platform
• Focused more on crawling sites to build up data sources
• Offers limited training or refining abilities to make sure it
extracts data correctly.
• Enables access to the data source either as a downloadable
file or as an API.
© 2014 Tommy Tavenner
Import.io
• Data can be extracted either for a single page or a full site
© 2014 Tommy Tavenner
Import.io
Scrapinghub
• Designed for much larger scraping jobs, including multi-site
© 2014 Tommy Tavenner
Scrapinghub
• Sits somewhere between a visual scraper and a scraping
library.
• Custom scrapers may be developed in Python and hosted by
Scrapinghub
• The autoscraper allows annotating pages and training the
scraper
• The crawler starts with a single page and works out from there
following links on the pages it finds and quickly building large
databases.
© 2014 Tommy Tavenner
Scraping with a scripting language
• Libraries are available in most languages.
• Primarily make it easier to understand a certain format, i.e.
HTML or PDF.
• Require strong knowledge of the language
• Require more fine tuning but result in much higher quality data
© 2014 Tommy Tavenner
R
• scrapeR – for parsing HTML/XML
• XML package – for parsing HTML/XML
• tm – for parsing PDFs using Xpdf or Poppler engines
© 2014 Tommy Tavenner
Python
• ScraperWiki
• Scrapy
• BeautifulSoup – for parsing HTML
• XPath
• PDFMiner – for parsing PDFs
© 2014 Tommy Tavenner
PHP
• Simple HTML DOM
• PDF Parser
© 2014 Tommy Tavenner
Javascript
• NodeJS using Request and Cheerio
• jsPDF
• pdf2json
© 2014 Tommy Tavenner

More Related Content

PPTX
Web Scraping using Python | Web Screen Scraping
PDF
Tutorial on Web Scraping in Python
PDF
Skillshare - Introduction to Data Scraping
PDF
Getting started with Web Scraping in Python
PDF
Web scraping in python
PPTX
Web Scraping With Python
ODP
Introduction to Web Scraping using Python and Beautiful Soup
PPTX
WEB Scraping.pptx
Web Scraping using Python | Web Screen Scraping
Tutorial on Web Scraping in Python
Skillshare - Introduction to Data Scraping
Getting started with Web Scraping in Python
Web scraping in python
Web Scraping With Python
Introduction to Web Scraping using Python and Beautiful Soup
WEB Scraping.pptx

What's hot (20)

PPTX
Data science applications and usecases
PDF
What is Web-scraping?
PDF
Introduction To Data Science
PDF
Python for Data Science
PPTX
Data Visualization
PDF
Introduction to data science
PPTX
introduction to data science
PDF
Data visualization in Python
PPTX
Big data ppt
PPTX
Introduction to Data Engineering
PPTX
Data science
PDF
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
PDF
Data Analysis and Visualization using Python
PDF
Web mining slides
PDF
Data visualization introduction
PPTX
Text mining
PPTX
Data Visualization Tools in Python
PPTX
Intro to Neo4j
PPTX
RDF 개념 및 구문 소개
PDF
Introduction to the Semantic Web
Data science applications and usecases
What is Web-scraping?
Introduction To Data Science
Python for Data Science
Data Visualization
Introduction to data science
introduction to data science
Data visualization in Python
Big data ppt
Introduction to Data Engineering
Data science
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Analysis and Visualization using Python
Web mining slides
Data visualization introduction
Text mining
Data Visualization Tools in Python
Intro to Neo4j
RDF 개념 및 구문 소개
Introduction to the Semantic Web
Ad

Similar to Scraping data from the web and documents (20)

PDF
What are the different types of web scraping approaches
PDF
Is web scraping legal or not?
PPTX
DATA SCRAPING AND WEB Scrapping.....pptx
PPTX
Web Scraping Services.pptx
PDF
What is the difference between web scraping and api
PDF
Implementation of Web Application for Disease Prediction Using AI
PPTX
Semantic framework for web scraping.
PDF
ScrapeGraphAI: AI-powered web scraping, reso facile con l'open source
PDF
Search Engine Scrapper
PDF
Implementation ofWeb Application for Disease Prediction Using AI
PPT
Almost Scraping: Web Scraping without Programming
PPT
Web scrapingpanel
PDF
Multitudes of web scraping
PPTX
Web scrapping and how to do it using python.pptx
PDF
The ultimate guide to web scraping 2018
PPTX
2023 Guide How To Scrape Social Media Data Using Python (1).pptx
PDF
What is web scraping?
PDF
A introduction to Scraperwiki (for not developers)
PDF
Data scraper's toolbox
PPTX
Data scraper's toolbox
What are the different types of web scraping approaches
Is web scraping legal or not?
DATA SCRAPING AND WEB Scrapping.....pptx
Web Scraping Services.pptx
What is the difference between web scraping and api
Implementation of Web Application for Disease Prediction Using AI
Semantic framework for web scraping.
ScrapeGraphAI: AI-powered web scraping, reso facile con l'open source
Search Engine Scrapper
Implementation ofWeb Application for Disease Prediction Using AI
Almost Scraping: Web Scraping without Programming
Web scrapingpanel
Multitudes of web scraping
Web scrapping and how to do it using python.pptx
The ultimate guide to web scraping 2018
2023 Guide How To Scrape Social Media Data Using Python (1).pptx
What is web scraping?
A introduction to Scraperwiki (for not developers)
Data scraper's toolbox
Data scraper's toolbox
Ad

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25-Week II
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx

Scraping data from the web and documents

  • 1. Scraping Data from Documents and the Web Tommy Tavenner National Wildlife Federation
  • 2. What is it? © 2014 Tommy Tavenner
  • 3. What is Scraping? • Converting data from human readable into machine readable • This data is sometimes referred to as ‘unstructured’ but is really just not structured properly for systematic parsing • The data is often embedded in layers of formatting meta data. Think HTML or PDF formatting like font colors and tables. • The job of the scraper is to separate the data from the formatting. In some cases even using the formatting to interpret the data. © 2014 Tommy Tavenner
  • 4. Is it Legal? © 2014 Tommy Tavenner
  • 6. Is Scraping Legal? • It depends • Most publically available data in the US falls within the sphere of copyright protection. > Creativity in producing the source data > The manner in which the data is presented > Fair Use on the web • What is the purpose of the scraping? © 2014 Tommy Tavenner
  • 7. Is Scraping Legal? • Terms of Service > Does it explicitly prohibit scraping? > Does it prohibit storing information privately? © 2014 Tommy Tavenner
  • 8. Is Scraping Legal? • Feist v. Rural Telephone (1991) > Feist, a phone book compiler in Kansas, copied the contents of Rural Telephone’s directory after Rural refused to license the information. > Rural sued Feist for copyright infringement. Because of the nature of the information, the case eventually made it to the supreme court. > The case centered on originality and whether compiling facts constitutes an original work. > The court ruled that the phone directory did not constitute and original compilation because no discretion was exercised in deciding on contents. © 2014 Tommy Tavenner
  • 9. Is Scraping Legal? • LinkedIn case (2014) > Suing a group of unknown defendants in California. > LinkedIn alleges that this group used a series of bots and fake profiles on the site to scrape content from other member profiles > The case is based on the Digital Millennium Copyright Act. © 2014 Tommy Tavenner
  • 10. Jargon • Spider – Searches for links within content and follows, building up a site map or web of content. • Crawler – Synonym for Spider • Training Data – Like in supervised machine learning, training data is used to teach a spider how to interpret the content they will be processing. • IP Proxy/Switching – Regular switching of IP address used to bypass restrictions on the number of connections per client set by web servers. May be a sign of less than legal or honorable intent in scraping. © 2014 Tommy Tavenner
  • 11. Anatomy of a Scraper Document Load • Pull in the complete web page, PDF, XML, etc. Parsing • Parse the HTML, XML, or PDF meta data into something the script can understand Extraction • Use the results of parsing to extract the data we are looking for Transformation •Convert the data into useful formats, i.e. currency, dates, etc. © 2014 Tommy Tavenner
  • 12. Anatomy of a Scraper Document Load • Load the entire document or HTML page. Generally as a string of characters. • For larger documents this may involve splitting it into multiple pages © 2014 Tommy Tavenner
  • 13. Anatomy of a Scraper Parsing • Interpret the document to make searching possible. • Biggest potential failure point • Specific to the source data. • HTML Document Object Model • PDF Grid Model © 2014 Tommy Tavenner
  • 14. Anatomy of a Scraper Extraction • Search parsed data for particular pieces of information • i.e. file name, link, or table • Separate data into individual pieces for later processing © 2014 Tommy Tavenner
  • 15. Anatomy of a Scraper Transformation • Convert data into proper output • Apply standards • Change type • i.e. date string date © 2014 Tommy Tavenner
  • 16. Visual Scraping tools • Require no programming knowledge • Primarily web-based • Allow quick access to data • Because they are not bespoke may require more scrubbing of the data after scraping © 2014 Tommy Tavenner
  • 17. ScraperWiki • Paid Service with very basic free plan • Focused on table extraction and Twitter data • Takes a single page or document as its source © 2014 Tommy Tavenner
  • 18. ScraperWiki • Allows you to quickly access the data or summarize it. • Works well with PDF’s of tables but struggles with mixed data. © 2014 Tommy Tavenner
  • 19. Import.io • In early stages, currently free with professional accounts • Downloadable Java app – multi-platform • Focused more on crawling sites to build up data sources • Offers limited training or refining abilities to make sure it extracts data correctly. • Enables access to the data source either as a downloadable file or as an API. © 2014 Tommy Tavenner
  • 20. Import.io • Data can be extracted either for a single page or a full site © 2014 Tommy Tavenner
  • 22. Scrapinghub • Designed for much larger scraping jobs, including multi-site © 2014 Tommy Tavenner
  • 23. Scrapinghub • Sits somewhere between a visual scraper and a scraping library. • Custom scrapers may be developed in Python and hosted by Scrapinghub • The autoscraper allows annotating pages and training the scraper • The crawler starts with a single page and works out from there following links on the pages it finds and quickly building large databases. © 2014 Tommy Tavenner
  • 24. Scraping with a scripting language • Libraries are available in most languages. • Primarily make it easier to understand a certain format, i.e. HTML or PDF. • Require strong knowledge of the language • Require more fine tuning but result in much higher quality data © 2014 Tommy Tavenner
  • 25. R • scrapeR – for parsing HTML/XML • XML package – for parsing HTML/XML • tm – for parsing PDFs using Xpdf or Poppler engines © 2014 Tommy Tavenner
  • 26. Python • ScraperWiki • Scrapy • BeautifulSoup – for parsing HTML • XPath • PDFMiner – for parsing PDFs © 2014 Tommy Tavenner
  • 27. PHP • Simple HTML DOM • PDF Parser © 2014 Tommy Tavenner
  • 28. Javascript • NodeJS using Request and Cheerio • jsPDF • pdf2json © 2014 Tommy Tavenner