SlideShare a Scribd company logo
Web Scraping And Analytics With Python
For Queries:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN
For more details please contact us:
US : 1800 275 9730 (toll free)
INDIA : +91 88808 62004
Email Us : sales@edureka.co
View Mastering Python course details at http://www.edureka.co/python
Slide 2 www.edureka.co/python
At the end of this module, you will be able to
Objectives
 What is Web Scraping
 BeautifulSoup Scraping Package
 Scraping IMDB WebPage
 PyDoop Package for Analytics
Slide 3 www.edureka.in/python
Web Scraping
Web scraping (web harvesting or web data extraction) is a computer software technique of extracting
information from websites
» Web scraping is a method for pulling data from the structured (or not so structured) HTML that makes up a
web page
» Most websites do not offer the functionality to save a copy of the data which they display to your local
storage. We can only view them on the web
» If we have to store them, we will need to manually copy and paste the data displayed by the website in the
browser to a local file – This is a very tedious job which can take many hours or sometimes days to complete
» Imagine getting data from google finance page to know historic information about multiple companies. We
can simply automate the process with few lines of code and will get the desired result. If the information
change, simply we have to rerun the code, instead doing manually again!!
Slide 4 www.edureka.in/python
Web Scrape - Why ?
Many websites do not provide an API (Unlike facebook, twitter etc). Web scraping is the only alternative to get data
from them
A quick, easy and free way to gather huge data with considerably less effort
Saves manual effort of copying and storing data
If done manually, only way to keep them is in text format which again need to be converted to JSON or XML format
for further processing
Slide 5 www.edureka.in/python
Web Scraping
Python has numerous libraries for approaching this type of problem, many of which are incredibly powerful
Popular web scrapping python packages:
» Pattern
» Requests
» Scrapy
» BeautifulSoup
» Mechanize
In this course we are covering Beautiful Soup which is most popular in the lot
But these packages can work together too
Slide 6 www.edureka.in/python
Typical HTML structure
Note: Save this file with a .html extension
Slide 7 www.edureka.in/python
BeautifulSoup Installation
If you run Debian or Ubuntu, you can install Beautiful Soup with the system package manager:
» sudo apt-get install python-bs4
To install from PyPi:
» easy_install beautifulsoup4
or
pip install beautifulsoup4
If you have downloaded the source tarball and want to install manually:
» python setup.py install
Refer http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup to avoid any installation
related errors and to install other useful packages like lxml parser
Slide 8 www.edureka.in/python
BeautifulSoup for Parsing a Doc
To parse a document, pass it into
the BeautifulSoup constructor
We can pass in a string or an open filehandle
Example:
from bs4 import BeautifulSoup
# Using a stored HTML file
soup = BeautifulSoup(open(“simple.html"))
# Entire HTML doc can be passed
soup = BeautifulSoup("<html>data</html>")
Slide 9 www.edureka.in/python
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
Example: soup = BeautifulSoup('<b class=“price">New Rate </b>')
Tag Object:
» A Tag object corresponds to a HTML tag in the original document.
Attributes Object:
» A tag may have any number of attributes.
» The tag <p class=“price"> has an attribute “class” whose value is “price”.
» You can access a tag’s attributes by treating the tag like a dictionary:
» tag [‘class’]
» Access attributes by .attrs:
Different Objects
Slide 10 www.edureka.in/python
NavigableString Object:
» Beautiful Soup uses the NavigableString class to contain bits of text within a tag:
Comments:
» This is a special type of NavigableString Object:
Different Objects(Contd.)
Slide 11 www.edureka.in/python
All Supported Operations on TAG Object
Slide 12 www.edureka.in/python
Soup.prettify()
Used When the HTML doc looks jumbled and we want to see it structured by parsing each tag on its own line
It helps to visualize HTML tags in a better way and we can see the parent, children, sibling tags easily
Example html doc:
Slide 13 www.edureka.in/python
Example HTML Doc For Reference
Slide 14 www.edureka.in/python
Prettifying the Example
Slide 15 www.edureka.in/python
Lets scrape IMDB to find out Top movies released during 2005 to 2014 sorted by number of user votes. We need to pull
title, year, genres, runtime, rating and image source info for the movies
Step 1: Go to the base IMDB search url: http://www.imdb.com/search/title
Step 2: We will use the full url directly built by IMDB website while entering our requirement:
http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=2005,2014
Scraping IMDB Webpage
Slide 16 www.edureka.in/python
Target Page
Slide 17 www.edureka.in/python
Find the Required Fields in the Source
 Step 3:
» Right click on the webpage and “inspect element” (In Chrome)
» Hover your mouse over the source page at bottom to get the corresponding location in the web page
» Example: See the title selected
Slide 18 www.edureka.in/python
Finding Other Fields
To find “genre” we will have to inspect further down
Since multiple genre possible for one movie, we will have to use it in a loop
Slide 19 www.edureka.in/python
Using the Full url
Step 4:
» Access the url
Direct url passing way:
Step 5: Pass to BeautifulSoup
» Building Main Logic
Step 6: Formatting the Output
Slide 20 www.edureka.in/python
Formatting the Output - Sample Shown
Slide 21 www.edureka.co/python
PyDoop – Hadoop with Python
 PyDoop package provides a Python API for Hadoop MapReduce and
HDFS
 PyDoop has several advantages over Hadoop’s built-in solutions for
Python programming, i.e., Hadoop Streaming and Jython
 One of the biggest advantage of PyDoop is it’s HDFS API. This
allows you to connect to an HDFS installation, read and write files, and
get information on files, directories and global file system properties
 The MapReduce API of PyDoop allows you to solve many complex
problems with minimal programming efforts. Advance MapReduce
concepts such as ‘Counters’ and ‘Record Readers’ can be implemented
in Python using PyDoop
Python can be used to write Hadoop MapReduce programs and applications to access HDFS API for Hadoop with
PyDoop package
Slide 22 www.edureka.co/python
Demo: Python NLTK on Hadoop
Leveraging Analytical power of Python on Big Data Set. (MR + NLTK)
Perform stop word removal using Map Reduce.
Questions
Slide 23 www.edureka.co/pythonTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Slide 24 Course Url

More Related Content

PPTX
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
PPTX
Python for Big Data Analytics
PDF
Python webinar 4th june
PPTX
Python for Big Data Analytics
PPTX
Python and BIG Data analytics | Python Fundamentals | Python Architecture
PDF
Power of Python with Big Data
PDF
Is It A Right Time For Me To Learn Hadoop. Find out ?
PPTX
Hadoop for beginners free course ppt
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Python for Big Data Analytics
Python webinar 4th june
Python for Big Data Analytics
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Power of Python with Big Data
Is It A Right Time For Me To Learn Hadoop. Find out ?
Hadoop for beginners free course ppt

What's hot (20)

PDF
Graph Analysis over JSON, Larus
PDF
Big Data is changing abruptly, and where it is likely heading
PPTX
Python for data science
PDF
Python in Data Science Work
PDF
GalvanizeU Seattle: Eleven Almost-Truisms About Data
PDF
Big Data com Python
PPTX
Analyzing Data With Python
PDF
Adventure in Data: A tour of visualization projects at Twitter
DOCX
10 Popular Hadoop Technical Interview Questions
PPT
Big Data And Hadoop
PDF
Use of standards and related issues in predictive analytics
PPTX
HadoopWorkshopJuly2014
PDF
Using hadoop for big data
PPT
Big Graph Analytics on Neo4j with Apache Spark
PDF
Data Science in 2016: Moving Up
PDF
Myths of Data Science
PDF
Introduction to Google Cloud platform technologies
PPTX
How Do I Learn Big Data
PDF
Data Science in Future Tense
Graph Analysis over JSON, Larus
Big Data is changing abruptly, and where it is likely heading
Python for data science
Python in Data Science Work
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Big Data com Python
Analyzing Data With Python
Adventure in Data: A tour of visualization projects at Twitter
10 Popular Hadoop Technical Interview Questions
Big Data And Hadoop
Use of standards and related issues in predictive analytics
HadoopWorkshopJuly2014
Using hadoop for big data
Big Graph Analytics on Neo4j with Apache Spark
Data Science in 2016: Moving Up
Myths of Data Science
Introduction to Google Cloud platform technologies
How Do I Learn Big Data
Data Science in Future Tense
Ad

Viewers also liked (20)

PDF
SciPy India 2009
PDF
Introduction to NumPy for Machine Learning Programmers
PDF
New-Age Search through Apache Solr
PDF
Certbotで無料TLSサーバー
PDF
Rails I18n 20081125
PDF
正規表現の先読みについて
PDF
Ruby on Rails 開發環境建置 for Mac
PPTX
Sublime Text 2 Tips & Tricks
PDF
Ruby on Rails : 簡介與入門
PDF
Ruby 程式語言入門導覽
PDF
Ruby 程式語言簡介
PPT
Introduction to Apache Solr.
PDF
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
PDF
nadoka さんの m17n 対応のベストプラクティス
PDF
今日から始める人工知能 × 機械学習 Meetup ライトニングトーク1
PDF
Landset 8 的雲層去除技巧實作
PDF
hubot-slack v4移行時のハマりどころ #hubot_chatops
PDF
小魯蛇與他快樂的夥伴
PDF
合同勉強会20160821
PDF
lilo.linux.or.jp を wheezy から jessie にあげた話
SciPy India 2009
Introduction to NumPy for Machine Learning Programmers
New-Age Search through Apache Solr
Certbotで無料TLSサーバー
Rails I18n 20081125
正規表現の先読みについて
Ruby on Rails 開發環境建置 for Mac
Sublime Text 2 Tips & Tricks
Ruby on Rails : 簡介與入門
Ruby 程式語言入門導覽
Ruby 程式語言簡介
Introduction to Apache Solr.
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
nadoka さんの m17n 対応のベストプラクティス
今日から始める人工知能 × 機械学習 Meetup ライトニングトーク1
Landset 8 的雲層去除技巧實作
hubot-slack v4移行時のハマりどころ #hubot_chatops
小魯蛇與他快樂的夥伴
合同勉強会20160821
lilo.linux.or.jp を wheezy から jessie にあげた話
Ad

Similar to Python webinar 2nd july (20)

PPTX
Web Scrapping Using Python
PPTX
Data web analytics scraping 12345_II.pptx
PDF
Web Scraping Workshop
PDF
Intro to web scraping with Python
PDF
Web scraping in python
PPTX
Python FDP self learning presentations..
PPTX
Web programming using python frameworks.
PPTX
Python ScrapingPresentation for dummy.pptx
PPTX
Web Scraping using Python | Web Screen Scraping
PPTX
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
PPTX
Data-Analytics using python (Module 4).pptx
PPTX
Datasets, APIs, and Web Scraping
PPTX
WEB Scraping.pptx
PDF
Pydata-Python tools for webscraping
PPTX
Sesi 8_Scraping & API for really bnegineer.pptx
PDF
Web scraping in python
PDF
Python Crawler
PPTX
Pydata beautiful soup - Monica Puerto
PDF
Day 4 - Advance Python - Ground Gurus
Web Scrapping Using Python
Data web analytics scraping 12345_II.pptx
Web Scraping Workshop
Intro to web scraping with Python
Web scraping in python
Python FDP self learning presentations..
Web programming using python frameworks.
Python ScrapingPresentation for dummy.pptx
Web Scraping using Python | Web Screen Scraping
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Data-Analytics using python (Module 4).pptx
Datasets, APIs, and Web Scraping
WEB Scraping.pptx
Pydata-Python tools for webscraping
Sesi 8_Scraping & API for really bnegineer.pptx
Web scraping in python
Python Crawler
Pydata beautiful soup - Monica Puerto
Day 4 - Advance Python - Ground Gurus

Python webinar 2nd july

  • 1. Web Scraping And Analytics With Python For Queries: Post on Twitter @edurekaIN: #askEdureka Post on Facebook /edurekaIN For more details please contact us: US : 1800 275 9730 (toll free) INDIA : +91 88808 62004 Email Us : [email protected] View Mastering Python course details at http://www.edureka.co/python
  • 2. Slide 2 www.edureka.co/python At the end of this module, you will be able to Objectives  What is Web Scraping  BeautifulSoup Scraping Package  Scraping IMDB WebPage  PyDoop Package for Analytics
  • 3. Slide 3 www.edureka.in/python Web Scraping Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites » Web scraping is a method for pulling data from the structured (or not so structured) HTML that makes up a web page » Most websites do not offer the functionality to save a copy of the data which they display to your local storage. We can only view them on the web » If we have to store them, we will need to manually copy and paste the data displayed by the website in the browser to a local file – This is a very tedious job which can take many hours or sometimes days to complete » Imagine getting data from google finance page to know historic information about multiple companies. We can simply automate the process with few lines of code and will get the desired result. If the information change, simply we have to rerun the code, instead doing manually again!!
  • 4. Slide 4 www.edureka.in/python Web Scrape - Why ? Many websites do not provide an API (Unlike facebook, twitter etc). Web scraping is the only alternative to get data from them A quick, easy and free way to gather huge data with considerably less effort Saves manual effort of copying and storing data If done manually, only way to keep them is in text format which again need to be converted to JSON or XML format for further processing
  • 5. Slide 5 www.edureka.in/python Web Scraping Python has numerous libraries for approaching this type of problem, many of which are incredibly powerful Popular web scrapping python packages: » Pattern » Requests » Scrapy » BeautifulSoup » Mechanize In this course we are covering Beautiful Soup which is most popular in the lot But these packages can work together too
  • 6. Slide 6 www.edureka.in/python Typical HTML structure Note: Save this file with a .html extension
  • 7. Slide 7 www.edureka.in/python BeautifulSoup Installation If you run Debian or Ubuntu, you can install Beautiful Soup with the system package manager: » sudo apt-get install python-bs4 To install from PyPi: » easy_install beautifulsoup4 or pip install beautifulsoup4 If you have downloaded the source tarball and want to install manually: » python setup.py install Refer http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup to avoid any installation related errors and to install other useful packages like lxml parser
  • 8. Slide 8 www.edureka.in/python BeautifulSoup for Parsing a Doc To parse a document, pass it into the BeautifulSoup constructor We can pass in a string or an open filehandle Example: from bs4 import BeautifulSoup # Using a stored HTML file soup = BeautifulSoup(open(“simple.html")) # Entire HTML doc can be passed soup = BeautifulSoup("<html>data</html>")
  • 9. Slide 9 www.edureka.in/python Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. Example: soup = BeautifulSoup('<b class=“price">New Rate </b>') Tag Object: » A Tag object corresponds to a HTML tag in the original document. Attributes Object: » A tag may have any number of attributes. » The tag <p class=“price"> has an attribute “class” whose value is “price”. » You can access a tag’s attributes by treating the tag like a dictionary: » tag [‘class’] » Access attributes by .attrs: Different Objects
  • 10. Slide 10 www.edureka.in/python NavigableString Object: » Beautiful Soup uses the NavigableString class to contain bits of text within a tag: Comments: » This is a special type of NavigableString Object: Different Objects(Contd.)
  • 11. Slide 11 www.edureka.in/python All Supported Operations on TAG Object
  • 12. Slide 12 www.edureka.in/python Soup.prettify() Used When the HTML doc looks jumbled and we want to see it structured by parsing each tag on its own line It helps to visualize HTML tags in a better way and we can see the parent, children, sibling tags easily Example html doc:
  • 13. Slide 13 www.edureka.in/python Example HTML Doc For Reference
  • 15. Slide 15 www.edureka.in/python Lets scrape IMDB to find out Top movies released during 2005 to 2014 sorted by number of user votes. We need to pull title, year, genres, runtime, rating and image source info for the movies Step 1: Go to the base IMDB search url: http://www.imdb.com/search/title Step 2: We will use the full url directly built by IMDB website while entering our requirement: http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=2005,2014 Scraping IMDB Webpage
  • 17. Slide 17 www.edureka.in/python Find the Required Fields in the Source  Step 3: » Right click on the webpage and “inspect element” (In Chrome) » Hover your mouse over the source page at bottom to get the corresponding location in the web page » Example: See the title selected
  • 18. Slide 18 www.edureka.in/python Finding Other Fields To find “genre” we will have to inspect further down Since multiple genre possible for one movie, we will have to use it in a loop
  • 19. Slide 19 www.edureka.in/python Using the Full url Step 4: » Access the url Direct url passing way: Step 5: Pass to BeautifulSoup » Building Main Logic Step 6: Formatting the Output
  • 20. Slide 20 www.edureka.in/python Formatting the Output - Sample Shown
  • 21. Slide 21 www.edureka.co/python PyDoop – Hadoop with Python  PyDoop package provides a Python API for Hadoop MapReduce and HDFS  PyDoop has several advantages over Hadoop’s built-in solutions for Python programming, i.e., Hadoop Streaming and Jython  One of the biggest advantage of PyDoop is it’s HDFS API. This allows you to connect to an HDFS installation, read and write files, and get information on files, directories and global file system properties  The MapReduce API of PyDoop allows you to solve many complex problems with minimal programming efforts. Advance MapReduce concepts such as ‘Counters’ and ‘Record Readers’ can be implemented in Python using PyDoop Python can be used to write Hadoop MapReduce programs and applications to access HDFS API for Hadoop with PyDoop package
  • 22. Slide 22 www.edureka.co/python Demo: Python NLTK on Hadoop Leveraging Analytical power of Python on Big Data Set. (MR + NLTK) Perform stop word removal using Map Reduce.
  • 23. Questions Slide 23 www.edureka.co/pythonTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions