Python webinar 2nd july

Web Scraping And Analytics With Python
For Queries:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN
For more details please contact us:
US : 1800 275 9730 (toll free)
INDIA : +91 88808 62004
Email Us : sales@edureka.co
View Mastering Python course details at http://www.edureka.co/python

www.edureka.co/python
At the end of this module, you will be able to
Objectives
 What is Web Scraping
 BeautifulSoup Scraping Package
 Scraping IMDB WebPage
 PyDoop Package for Analytics

www.edureka.in/python
Web Scraping
Web scraping (web harvesting or web data extraction) is a computer software technique of extracting
information from websites
» Web scraping is a method for pulling data from the structured (or not so structured) HTML that makes up a
web page
» Most websites do not offer the functionality to save a copy of the data which they display to your local
storage. We can only view them on the web
» If we have to store them, we will need to manually copy and paste the data displayed by the website in the
browser to a local file – This is a very tedious job which can take many hours or sometimes days to complete
» Imagine getting data from google finance page to know historic information about multiple companies. We
can simply automate the process with few lines of code and will get the desired result. If the information
change, simply we have to rerun the code, instead doing manually again!!

Web Scrape - Why ?
Many websites do not provide an API (Unlike facebook, twitter etc). Web scraping is the only alternative to get data
from them
A quick, easy and free way to gather huge data with considerably less effort
Saves manual effort of copying and storing data
If done manually, only way to keep them is in text format which again need to be converted to JSON or XML format
for further processing

Web Scraping
Python has numerous libraries for approaching this type of problem, many of which are incredibly powerful
Popular web scrapping python packages:
» Pattern
» Requests
» Scrapy
» BeautifulSoup
» Mechanize
In this course we are covering Beautiful Soup which is most popular in the lot
But these packages can work together too

Typical HTML structure
Note: Save this file with a .html extension

BeautifulSoup Installation
If you run Debian or Ubuntu, you can install Beautiful Soup with the system package manager:
» sudo apt-get install python-bs4
To install from PyPi:
» easy_install beautifulsoup4
or
pip install beautifulsoup4
If you have downloaded the source tarball and want to install manually:
» python setup.py install
Refer http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup to avoid any installation
related errors and to install other useful packages like lxml parser

BeautifulSoup for Parsing a Doc
To parse a document, pass it into
the BeautifulSoup constructor
We can pass in a string or an open filehandle
Example:
from bs4 import BeautifulSoup
# Using a stored HTML file
soup = BeautifulSoup(open(“simple.html"))
# Entire HTML doc can be passed
soup = BeautifulSoup("<html>data</html>")

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
Example: soup = BeautifulSoup('<b class=“price">New Rate </b>')
Tag Object:
» A Tag object corresponds to a HTML tag in the original document.
Attributes Object:
» A tag may have any number of attributes.
» The tag <p class=“price"> has an attribute “class” whose value is “price”.
» You can access a tag’s attributes by treating the tag like a dictionary:
» tag [‘class’]
» Access attributes by .attrs:
Different Objects

NavigableString Object:
» Beautiful Soup uses the NavigableString class to contain bits of text within a tag:
Comments:
» This is a special type of NavigableString Object:
Different Objects(Contd.)

All Supported Operations on TAG Object

Soup.prettify()
Used When the HTML doc looks jumbled and we want to see it structured by parsing each tag on its own line
It helps to visualize HTML tags in a better way and we can see the parent, children, sibling tags easily
Example html doc:

Example HTML Doc For Reference

Prettifying the Example

Lets scrape IMDB to find out Top movies released during 2005 to 2014 sorted by number of user votes. We need to pull
title, year, genres, runtime, rating and image source info for the movies
Step 1: Go to the base IMDB search url: http://www.imdb.com/search/title
Step 2: We will use the full url directly built by IMDB website while entering our requirement:
http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=2005,2014
Scraping IMDB Webpage

Target Page

Find the Required Fields in the Source
 Step 3:
» Right click on the webpage and “inspect element” (In Chrome)
» Hover your mouse over the source page at bottom to get the corresponding location in the web page
» Example: See the title selected

Finding Other Fields
To find “genre” we will have to inspect further down
Since multiple genre possible for one movie, we will have to use it in a loop

Using the Full url
Step 4:
» Access the url
Direct url passing way:
Step 5: Pass to BeautifulSoup
» Building Main Logic
Step 6: Formatting the Output

Formatting the Output - Sample Shown

PyDoop – Hadoop with Python
 PyDoop package provides a Python API for Hadoop MapReduce and
HDFS
 PyDoop has several advantages over Hadoop’s built-in solutions for
Python programming, i.e., Hadoop Streaming and Jython
 One of the biggest advantage of PyDoop is it’s HDFS API. This
allows you to connect to an HDFS installation, read and write files, and
get information on files, directories and global file system properties
 The MapReduce API of PyDoop allows you to solve many complex
problems with minimal programming efforts. Advance MapReduce
concepts such as ‘Counters’ and ‘Record Readers’ can be implemented
in Python using PyDoop
Python can be used to write Hadoop MapReduce programs and applications to access HDFS API for Hadoop with
PyDoop package

Demo: Python NLTK on Hadoop
Leveraging Analytical power of Python on Big Data Set. (MR + NLTK)
Perform stop word removal using Map Reduce.

Questions
Slide 23 www.edureka.co/pythonTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Python webinar 2nd july

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Python webinar 2nd july (20)

Python webinar 2nd july