SlideShare a Scribd company logo
Introduction to Scraping in
            Python


By :-
   
        Mayank Jain (firesofmay@gmail.com)
   
        Gaurav Jain (grvmjain@gmail.com)

                   Code is available at
        https://github.com/firesofmay/Null-Pune-
           Intro-to-Scraping-Talk-March-2012
Overview of the ”Presentation”

    What is Scraping?

    So what is this HTTP?

    Tools of Trade

    User Agents

    Firebug

    Using BeautfulSoup and Regular Expressions

    Using Google Translator to post on Facebook in
    hindi

    Shodan

    Robots.txt
What is Scraping?

    Web scraping/Web harvesting/Web data
    extraction is a computer software
    technique of extracting information from
    websites.
So what is this HTTP thing?

    If you goto this page -
    http://en.wikipedia.org/wiki/Python_%28programming_language%29


    To view the HTTP Requests being made
    we use a firefox Pluging called as
    LiveHTTPHeaders
----------Request From Client to Server----------
GET /wiki/Python_(programming_language) HTTP/1.1
Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: http://en.wikipedia.org/wiki/Python
Cookie: clicktracking-session=QgVKVqIpsfsgsgszgvwBCASkSOdw2O;
  mediaWiki.user.bucket:ext.articleFeedback-tracking=8%3Aignore;
  mediaWiki.user.bucket:ext.articleFeedback-options=8%3Ashow
----------End of Request From Client to Server----------
----------Response From Server to Client----------

    HTTP/1.0 200 OK

    Date: Mon, 10 Oct 2011 12:44:46 GMT

    Server: Apache

    X-Content-Type-Options: nosniff

    Cache-Control: private, s-maxage=0, max-age=0, must-revalidate

    Content-Language: en

    Vary: Accept-Encoding,Cookie

    Last-Modified: Sun, 09 Oct 2011 05:01:32 GMT

    Content-Encoding: gzip

    Content-Length: 47407

    Content-Type: text/html; charset=UTF-8

    Age: 10932

    X-Cache: HIT from sq66.wikimedia.org, MISS from sq65.wikimedia.org

    X-Cache-Lookup: HIT from sq66.wikimedia.org:3128, MISS from
    sq65.wikimedia.org:80

    Connection: keep-alive

    ----------End of Response From Server to Client----------
Tools of Trade

    Linux OS is prefered (Installations Command for
    Ubuntu Distro)

    Dreampie IDE (For Quick Prototyping)
        
            $ sudo apt-get install dreampie

    Python 2.x (Preferably 2.6+)

    pip installter for python packages
        
            $ sudo apt-get install python-pip

    Python requests: HTTP for Humans
        
            $ pip install requests

    Python re Library for regular Expressions
    (Inbuilt)

    LiveHTTPHeader Firefox Plugin
        
            https://addons.mozilla.org/en-US/firefox/
            addon/live-http-headers/

    Firebug Firefox Plugin
        
            https://addons.mozilla.org/en-US/firefox/
            addon/firebug/?src=search

    User Agent Switcher Firefox Plugin
        
            https://addons.mozilla.org/en-US/firefox/
            addon/user-agent-switcher/?src=search

    BeautifulSoup Python Library
        
            http://www.crummy.com/software/Beautif
            ulSoup/#Download
Fetching HTML Page (fetch.py)
import requests
url = 'http://en.wikipedia.org/wiki/Python_
  %28programming_language%29'
data = requests.get(url).content
f = open("debug.html", 'w')
f.write(data)
f.close()


#To Run

    $ python fetch.py
Why Does User Agent Matter?

    When software agent operates in a
    network protocol, it often identifies itself,
    its application type, operating system,
    software vendor, or software revision, by
    submitting a characteristic identification
    string to its operating peer.

    In HTTP, SIP, and SMTP/NNTP protocols,
    this identification is transmitted in a
    header field User-Agent. Bots, such as
    Web crawlers, often also include a URL
    and/or e-mail address so that the
    Webmaster can contact the operator of
    the bot.
Demo of How Sites Behave
Differently With Different UAs - I
  
      https://addons.mozilla.org/en-
      US/firefox/addon/user-agent-switcher/
  
      Visit the above site with UA (User Agent)
      as firefox
Introduction to python scrapping
Demo of How Sites Behave
Differently With Different UAs - I
  
      https://addons.mozilla.org/en-
      US/firefox/addon/user-agent-switcher/
  
      Now visit the above site with UA as IE
  
      To switch your User Agent Use User Agent
      Switcher Addon.
  
      Notice the new banner, asking you to
      install firefox even though you are using
      firefox (based on your user agent
      selected).
Introduction to python scrapping
Demo of How Sites Behave
Differently With Different UAs - II
 
     https://developers.facebook.com/docs/refe
     rence/api/permissions/
 
     Now visit the above site with UA as IE
         
             Asked for Login? But I don't want to
             Login!!!
 
     Let's try a Google bot as UA
         
             Yayyy!!
 
     Let's try a blank UA
         
             Yayy Again! :D
Introduction to python scrapping
Inspecting Elements with
               Firebug

    We want to fetch the Given Sale Price
    (19.99)


    Goto this link - http://www.payless.com/store/product/detail.jsp?
    catId=cat10243&subCatId=cat10243&skuId=091151050&productId=68423&lotId=091
    151&category=


    Right Click on $19.99 > Inspect Element
    with firebug
Inspecting Elements with
         Firebug
Demo Payless_Parser.py

    Run the code

    $ python Payless_Parser.py

    Price of this item is 19.99

    Modifiy The url variable to -
    http://www.payless.com/store/product/deta
    il.jsp?
    catId=cat10088&subCatId=cat10243&skuI
    d=094079050&productId=70984&lotId=09
    4079&category=&catdisplayName=Wome
    ns
    Why does this work? Try to understand.
How about Extracting all the
Permissions from this page?
Demo
Extract_Facebook_Permission
            s.py

    Url to extract from :
    https://developers.facebook.com/docs/refe
    rence/api/permissions/

    Check the next slide for Expected output
    and how to run the code

    $ python Extract_Facebook_Permissions.py

    ['user_about_me', 'friends_about_me', 'about', 'user_activities', 'friends_activities',
    'activities', 'user_birthday', 'friends_birthday', 'birthday', 'user_checkins',
    'friends_checkins', 'user_education_history', 'friends_education_history',
    'education', 'user_events', 'friends_events', 'events', 'user_groups',
    'friends_groups', 'groups', 'user_hometown', 'friends_hometown', 'hometown',
    'user_interests', 'friends_interests', 'interests', 'user_likes', 'friends_likes', 'likes',
    'user_location', 'friends_location', 'location', 'user_notes', 'friends_notes', 'notes',
    'user_photos', 'friends_photos', 'user_questions', 'friends_questions',
    'user_relationships', 'friends_relationships', 'user_relationship_details',
    'friends_relationship_details', 'user_religion_politics', 'friends_religion_politics',
    'user_status', 'friends_status', 'user_videos', 'friends_videos', 'user_website',
    'friends_website', 'user_work_history', 'friends_work_history', 'work', 'email',
    'email', 'read_friendlists', 'read_insights', 'read_mailbox', 'read_requests',
    'read_stream', 'xmpp_login', 'ads_management', 'create_event',
    'manage_friendlists', 'manage_notifications', 'user_online_presence',
    'friends_online_presence', 'publish_checkins', 'publish_stream', 'publish_stream',
    'rsvp_event']
How about writing our version
  of Google Translate API?

    Important: Google Translate API v2 is
    now available as a paid service only,
    and the number of requests your
    application can make per day is limited. As
    of December 1, 2011, Google Translate
    API v1 is no longer available; it was
    officially deprecated on May 26, 2011.
    These decisions were made due to the
    substantial economic burden caused by
    extensive abuse. For website translations,
    we encourage you to use the Google
    Website Translator gadget.
Let's understand how it works
        in background.

    Use LiveHTTPHeaders To Understand this

    Important Parameters that are passed

    sl = en (Source Language = English)

    tl = hi (Target Language = Hindi)

    text = hello world


    http://translate.google.com/?
    sl=en&tl=hi&text=hello+world#
How about we post this
converted text to our facebook
           wall? :)

    fbconsole
       
           Facebook Python API
       
           Simplifies things
       
           Very easy to install
       
           https://github.com/facebook/fbconsole
       
           $ sudo pip install fbconsole


    We'll use the permissions we extracted in
    this script :)
Demo
Google_Translator_With_FB_API.py
$ python Google_Translator_With_FB_API.py
Language to Convert from : en
Language to Convert to : hi
Text to Convert : wow
Converted Text : वाह


    Check your facebook wall :)
Translated Text Posted on my
       Facebook Wall
What is Shodan?

    Web search engines, such as Google and
    Bing, are great for finding websites. But
    what if you're interested in finding
    computers running a certain piece of
    software (such as Apache)? Or if you want
    to know which version of Microsoft IIS is
    the most popular? Or you want to see how
    many anonymous FTP servers there are?
    Maybe a new vulnerability came out and
    you want to see how many hosts it could
    infect? Traditional web search engines
    don't let you answer those questions.
What is Shodan?

    SHODAN is a search engine that lets you
    find specific computers (routers, servers,
    etc.) using a variety of filters.

    Public port scan directory or a search
    engine of banners.
Scraping Shodan Data Preview

    http://www.shodanhq.com/

    Python API Is available -
    http://docs.shodanhq.com/

    But you have to get the advanced
    features. :-/

    By default, the following search filters for
    Shodan are disabled: net, country, before,
    after. To unlock those filters buy the
    Unlocked API Add-On. No subscription
    required!

    http://www.shodanhq.com/data/addons
Demo shodanparser_New.py
$ python shodanparser_New.py
Query : country:IN HTTP/1.0 200 OK
3
98.146.42.77United States
178.33.70.221      France
96.217.60.25United States
115.133.223.66     Malaysia
218.250.60.122     Hong Kong
180.177.12.132     Taiwan
178.63.104.140     Germany
76.85.55.178United States
67.159.200.99      United States
75.188.142.2United States
robots.txt

    The Robot Exclusion Standard, also
    known as the Robots Exclusion Protocol
    or robots.txt protocol, is a convention to
    prevent cooperating web crawlers and
    other web robots from accessing all or part
    of a website which is otherwise publicly
    viewable. Robots are often used by
    search engines to categorize and archive
    web sites, or by webmasters to proofread
    source code. The standard is different
    from, but can be used in conjunction with,
    Sitemaps, a robot inclusion standard for
    websites.
robots.txt

    Despite the use of the terms "allow" and
    "disallow", the protocol is purely advisory.
    It relies on the cooperation of the web
    robot, so that marking an area of a site out
    of bounds with robots.txt does not
    guarantee exclusion of all web robots. In
    particular, malicious web robots are
    unlikely to honor robots.txt
facebook.com/robots.txt
User-agent: Googlebot
Disallow: /ac.php
Disallow: /ae.php
Disallow: /album.php
Disallow: /ap.php
Disallow: /autologin.php
Disallow: /checkpoint/
…............
Conculsion

    Scraping has many usecases.

    Most useful to write your own API if the
    website does not provide one or has
    limitations.

    Very useful in combining Exiting APIs with
    websites that do not provide APIs

    Be careful of How badly you hit a server.

    Follow robots.txt or take permissions.
References

    Advance Scraping Video -
       
           http://pyvideo.org/video/609/web-
           scraping-reliably-and-efficiently-pull-data

    Google Python Class Intermediate
       
           http://code.google.com/edu/languages/g
           oogle-python-class/set-up.html
       
           http://www.youtube.com/watch?
           v=tKTZoB2Vjuk&feature=plcp&context=
           C42cb319VDvjVQa1PpcFMzwqYlYKVx
           DoyEu1ISDDTjmz370vY8Xg4%3D
References

    Python Absolute Beginner
       
           http://www.youtube.com/watch?
           v=4Mf0h3HphEA&feature=channel_vide
           o_title


    Siddhant Sanyam's PyCon 11 Slides
       
           https://github.com/siddhant3s/PyCon11-
           Talk/tree/master/talk1_webscrapping
References

    http://firesofmay.blogspot.in/2011/10/http-
    web-scrapping-and-python-part-1.html
from BeautifulSoup import BeautifulSoup


import requests, sys


url = 'http://translate.google.com/?
  sl=en&tl=hi&text=Thank+you+Any+Questions?'


soup = BeautifulSoup(requests.get(url).content,
  convertEntities=BeautifulSoup.HTML_ENTITIES)


print soup.find('div', {'id' : 'gt-res-content'}).find('span',
  {'id':'result_box'}).text
Executing...
शुििया

कोई पश?

More Related Content

What's hot (15)

PPT
Justmeans power point
justmeanscsr
 
PPT
Justmeans power point
justmeanscsr
 
PPT
Php intro
Rajesh Jha
 
PPTX
Web backends development using Python
Ayun Park
 
PDF
The Loop
Gary Barber
 
PDF
Composer The Right Way - 010PHP
Rafael Dohms
 
PDF
Composer the right way - SunshinePHP
Rafael Dohms
 
PPTX
Inside a Digital Collection: Historic Clothing in Omeka
Arden Kirkland
 
PPT
ReST-ful Resource Management
Joe Davis
 
PDF
Composer the Right Way - PHPBNL16
Rafael Dohms
 
PDF
Various Ways of Using WordPress
Nick La
 
PDF
Introduction to php web programming - get and post
baabtra.com - No. 1 supplier of quality freshers
 
PPT
Introduction to Google API - Focusky
Focusky Presentation
 
PDF
Building a Dynamic Website Using Django
Nathan Eror
 
PPT
Short Intro to PHP and MySQL
Jussi Pohjolainen
 
Justmeans power point
justmeanscsr
 
Justmeans power point
justmeanscsr
 
Php intro
Rajesh Jha
 
Web backends development using Python
Ayun Park
 
The Loop
Gary Barber
 
Composer The Right Way - 010PHP
Rafael Dohms
 
Composer the right way - SunshinePHP
Rafael Dohms
 
Inside a Digital Collection: Historic Clothing in Omeka
Arden Kirkland
 
ReST-ful Resource Management
Joe Davis
 
Composer the Right Way - PHPBNL16
Rafael Dohms
 
Various Ways of Using WordPress
Nick La
 
Introduction to php web programming - get and post
baabtra.com - No. 1 supplier of quality freshers
 
Introduction to Google API - Focusky
Focusky Presentation
 
Building a Dynamic Website Using Django
Nathan Eror
 
Short Intro to PHP and MySQL
Jussi Pohjolainen
 

Viewers also liked (7)

PDF
Pydata-Python tools for webscraping
Jose Manuel Ortega Candel
 
PPTX
Web Scraping With Python
Robert Dempsey
 
PPT
Premier pas de web scrapping avec R
Cdiscount
 
PDF
Introduction à la cartographie avec R
Cdiscount
 
PDF
Rapport PFE : Développement D'une application de gestion des cartes de fidéli...
Riadh K.
 
PPTX
Le b.a.-ba du web scraping
Alexandre Gindre
 
PDF
Rapport Projet De Fin D'étude Développent d'une application web avec Symfony2
Sofien Benrhouma
 
Pydata-Python tools for webscraping
Jose Manuel Ortega Candel
 
Web Scraping With Python
Robert Dempsey
 
Premier pas de web scrapping avec R
Cdiscount
 
Introduction à la cartographie avec R
Cdiscount
 
Rapport PFE : Développement D'une application de gestion des cartes de fidéli...
Riadh K.
 
Le b.a.-ba du web scraping
Alexandre Gindre
 
Rapport Projet De Fin D'étude Développent d'une application web avec Symfony2
Sofien Benrhouma
 
Ad

Similar to Introduction to python scrapping (20)

PDF
An Introduction to Tornado
Gavin Roy
 
PDF
Crawler
hackstuff
 
PDF
Web Development with Python and Django
Michael Pirnat
 
PDF
Python Load Testing - Pygotham 2012
Dan Kuebrich
 
PDF
Web Scrapping with Python
Miguel Miranda de Mattos
 
PPTX
Controlling the browser through python and selenium
Patrick Viafore
 
PDF
release_python_day3_slides_201606.pdf
Paul Yang
 
PPTX
Browser
Shweta Oza
 
PPTX
Module-5 Ppt.pptx
ssuser44f56b1
 
PDF
Python Crawler
Cheng-Yi Yu
 
PDF
Web Security - Introduction v.1.3
Oles Seheda
 
PDF
Web Security - Introduction
SQALab
 
PDF
Python Web Interaction
Robert Sanderson
 
PPTX
Django course
Nagi Annapureddy
 
PDF
Python and the Web
pycontw
 
POT
Web Techology and google code sh (2014_10_10 08_57_30 utc)
Suyash Gupta
 
PDF
Wfuzz para Penetration Testers
Source Conference
 
PDF
python full stack course in hyderabad...
sowmyavibhin
 
PPTX
python full stack course in hyderabad...
sowmyavibhin
 
An Introduction to Tornado
Gavin Roy
 
Crawler
hackstuff
 
Web Development with Python and Django
Michael Pirnat
 
Python Load Testing - Pygotham 2012
Dan Kuebrich
 
Web Scrapping with Python
Miguel Miranda de Mattos
 
Controlling the browser through python and selenium
Patrick Viafore
 
release_python_day3_slides_201606.pdf
Paul Yang
 
Browser
Shweta Oza
 
Module-5 Ppt.pptx
ssuser44f56b1
 
Python Crawler
Cheng-Yi Yu
 
Web Security - Introduction v.1.3
Oles Seheda
 
Web Security - Introduction
SQALab
 
Python Web Interaction
Robert Sanderson
 
Django course
Nagi Annapureddy
 
Python and the Web
pycontw
 
Web Techology and google code sh (2014_10_10 08_57_30 utc)
Suyash Gupta
 
Wfuzz para Penetration Testers
Source Conference
 
python full stack course in hyderabad...
sowmyavibhin
 
python full stack course in hyderabad...
sowmyavibhin
 
Ad

More from n|u - The Open Security Community (20)

PDF
Hardware security testing 101 (Null - Delhi Chapter)
n|u - The Open Security Community
 
PPTX
SSRF exploit the trust relationship
n|u - The Open Security Community
 
PDF
Metasploit primary
n|u - The Open Security Community
 
PDF
Api security-testing
n|u - The Open Security Community
 
PDF
Introduction to TLS 1.3
n|u - The Open Security Community
 
PDF
Gibson 101 -quick_introduction_to_hacking_mainframes_in_2020_null_infosec_gir...
n|u - The Open Security Community
 
PDF
Talking About SSRF,CRLF
n|u - The Open Security Community
 
PPTX
Building active directory lab for red teaming
n|u - The Open Security Community
 
PPTX
Owning a company through their logs
n|u - The Open Security Community
 
PPTX
Introduction to shodan
n|u - The Open Security Community
 
PDF
Detecting persistence in windows
n|u - The Open Security Community
 
PPTX
Frida - Objection Tool Usage
n|u - The Open Security Community
 
PDF
OSQuery - Monitoring System Process
n|u - The Open Security Community
 
PDF
DevSecOps Jenkins Pipeline -Security
n|u - The Open Security Community
 
PDF
Extensible markup language attacks
n|u - The Open Security Community
 
PPTX
Linux for hackers
n|u - The Open Security Community
 
PDF
Android Pentesting
n|u - The Open Security Community
 
Hardware security testing 101 (Null - Delhi Chapter)
n|u - The Open Security Community
 
SSRF exploit the trust relationship
n|u - The Open Security Community
 
Api security-testing
n|u - The Open Security Community
 
Introduction to TLS 1.3
n|u - The Open Security Community
 
Gibson 101 -quick_introduction_to_hacking_mainframes_in_2020_null_infosec_gir...
n|u - The Open Security Community
 
Talking About SSRF,CRLF
n|u - The Open Security Community
 
Building active directory lab for red teaming
n|u - The Open Security Community
 
Owning a company through their logs
n|u - The Open Security Community
 
Introduction to shodan
n|u - The Open Security Community
 
Detecting persistence in windows
n|u - The Open Security Community
 
Frida - Objection Tool Usage
n|u - The Open Security Community
 
OSQuery - Monitoring System Process
n|u - The Open Security Community
 
DevSecOps Jenkins Pipeline -Security
n|u - The Open Security Community
 
Extensible markup language attacks
n|u - The Open Security Community
 

Recently uploaded (20)

PDF
I3PM Case study smart parking 2025 with uptoIP® and ABP
MIPLM
 
PPTX
Exploring Linear and Angular Quantities and Ergonomic Design.pptx
AngeliqueTolentinoDe
 
PDF
WATERSHED MANAGEMENT CASE STUDIES - ULUGURU MOUNTAINS AND ARVARI RIVERpdf
Ar.Asna
 
PDF
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
PPTX
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
PDF
Quiz Night Live May 2025 - Intra Pragya Online General Quiz
Pragya - UEM Kolkata Quiz Club
 
PPTX
Natural Language processing using nltk.pptx
Ramakrishna Reddy Bijjam
 
PDF
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
PPTX
grade 8 week 2 ict.pptx. matatag grade 7
VanessaTaberlo
 
PDF
Wikinomics How Mass Collaboration Changes Everything Don Tapscott
wcsqyzf5909
 
PDF
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
PDF
COM and NET Component Services 1st Edition Juval Löwy
kboqcyuw976
 
PPTX
Life and Career Skills Lesson 2.pptxProtective and Risk Factors of Late Adole...
ryangabrielcatalon40
 
PPTX
ENGLISH 8 REVISED K-12 CURRICULUM QUARTER 1 WEEK 1
LeomarrYsraelArzadon
 
PDF
Genomics Proteomics and Vaccines 1st Edition Guido Grandi (Editor)
kboqcyuw976
 
PPTX
Connecting Linear and Angular Quantities in Human Movement.pptx
AngeliqueTolentinoDe
 
PPTX
Marketing Management PPT Unit 1 and Unit 2.pptx
Sri Ramakrishna College of Arts and science
 
PDF
AI-assisted IP-Design lecture from the MIPLM 2025
MIPLM
 
PPTX
Building Powerful Agentic AI with Google ADK, MCP, RAG, and Ollama.pptx
Tamanna36
 
PPTX
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
I3PM Case study smart parking 2025 with uptoIP® and ABP
MIPLM
 
Exploring Linear and Angular Quantities and Ergonomic Design.pptx
AngeliqueTolentinoDe
 
WATERSHED MANAGEMENT CASE STUDIES - ULUGURU MOUNTAINS AND ARVARI RIVERpdf
Ar.Asna
 
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
How to Configure Refusal of Applicants in Odoo 18 Recruitment
Celine George
 
Quiz Night Live May 2025 - Intra Pragya Online General Quiz
Pragya - UEM Kolkata Quiz Club
 
Natural Language processing using nltk.pptx
Ramakrishna Reddy Bijjam
 
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
grade 8 week 2 ict.pptx. matatag grade 7
VanessaTaberlo
 
Wikinomics How Mass Collaboration Changes Everything Don Tapscott
wcsqyzf5909
 
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
COM and NET Component Services 1st Edition Juval Löwy
kboqcyuw976
 
Life and Career Skills Lesson 2.pptxProtective and Risk Factors of Late Adole...
ryangabrielcatalon40
 
ENGLISH 8 REVISED K-12 CURRICULUM QUARTER 1 WEEK 1
LeomarrYsraelArzadon
 
Genomics Proteomics and Vaccines 1st Edition Guido Grandi (Editor)
kboqcyuw976
 
Connecting Linear and Angular Quantities in Human Movement.pptx
AngeliqueTolentinoDe
 
Marketing Management PPT Unit 1 and Unit 2.pptx
Sri Ramakrishna College of Arts and science
 
AI-assisted IP-Design lecture from the MIPLM 2025
MIPLM
 
Building Powerful Agentic AI with Google ADK, MCP, RAG, and Ollama.pptx
Tamanna36
 
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 

Introduction to python scrapping

  • 1. Introduction to Scraping in Python By :-  Mayank Jain ([email protected])  Gaurav Jain ([email protected]) Code is available at https://github.com/firesofmay/Null-Pune- Intro-to-Scraping-Talk-March-2012
  • 2. Overview of the ”Presentation”  What is Scraping?  So what is this HTTP?  Tools of Trade  User Agents  Firebug  Using BeautfulSoup and Regular Expressions  Using Google Translator to post on Facebook in hindi  Shodan  Robots.txt
  • 3. What is Scraping?  Web scraping/Web harvesting/Web data extraction is a computer software technique of extracting information from websites.
  • 4. So what is this HTTP thing?  If you goto this page - http://en.wikipedia.org/wiki/Python_%28programming_language%29  To view the HTTP Requests being made we use a firefox Pluging called as LiveHTTPHeaders
  • 5. ----------Request From Client to Server---------- GET /wiki/Python_(programming_language) HTTP/1.1 Host: en.wikipedia.org User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip, deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Connection: keep-alive Referer: http://en.wikipedia.org/wiki/Python Cookie: clicktracking-session=QgVKVqIpsfsgsgszgvwBCASkSOdw2O; mediaWiki.user.bucket:ext.articleFeedback-tracking=8%3Aignore; mediaWiki.user.bucket:ext.articleFeedback-options=8%3Ashow ----------End of Request From Client to Server----------
  • 6. ----------Response From Server to Client----------  HTTP/1.0 200 OK  Date: Mon, 10 Oct 2011 12:44:46 GMT  Server: Apache  X-Content-Type-Options: nosniff  Cache-Control: private, s-maxage=0, max-age=0, must-revalidate  Content-Language: en  Vary: Accept-Encoding,Cookie  Last-Modified: Sun, 09 Oct 2011 05:01:32 GMT  Content-Encoding: gzip  Content-Length: 47407  Content-Type: text/html; charset=UTF-8  Age: 10932  X-Cache: HIT from sq66.wikimedia.org, MISS from sq65.wikimedia.org  X-Cache-Lookup: HIT from sq66.wikimedia.org:3128, MISS from sq65.wikimedia.org:80  Connection: keep-alive  ----------End of Response From Server to Client----------
  • 7. Tools of Trade  Linux OS is prefered (Installations Command for Ubuntu Distro)  Dreampie IDE (For Quick Prototyping)  $ sudo apt-get install dreampie  Python 2.x (Preferably 2.6+)  pip installter for python packages  $ sudo apt-get install python-pip  Python requests: HTTP for Humans  $ pip install requests  Python re Library for regular Expressions (Inbuilt)
  • 8. LiveHTTPHeader Firefox Plugin  https://addons.mozilla.org/en-US/firefox/ addon/live-http-headers/  Firebug Firefox Plugin  https://addons.mozilla.org/en-US/firefox/ addon/firebug/?src=search  User Agent Switcher Firefox Plugin  https://addons.mozilla.org/en-US/firefox/ addon/user-agent-switcher/?src=search  BeautifulSoup Python Library  http://www.crummy.com/software/Beautif ulSoup/#Download
  • 9. Fetching HTML Page (fetch.py) import requests url = 'http://en.wikipedia.org/wiki/Python_ %28programming_language%29' data = requests.get(url).content f = open("debug.html", 'w') f.write(data) f.close() #To Run  $ python fetch.py
  • 10. Why Does User Agent Matter?  When software agent operates in a network protocol, it often identifies itself, its application type, operating system, software vendor, or software revision, by submitting a characteristic identification string to its operating peer.  In HTTP, SIP, and SMTP/NNTP protocols, this identification is transmitted in a header field User-Agent. Bots, such as Web crawlers, often also include a URL and/or e-mail address so that the Webmaster can contact the operator of the bot.
  • 11. Demo of How Sites Behave Differently With Different UAs - I  https://addons.mozilla.org/en- US/firefox/addon/user-agent-switcher/  Visit the above site with UA (User Agent) as firefox
  • 13. Demo of How Sites Behave Differently With Different UAs - I  https://addons.mozilla.org/en- US/firefox/addon/user-agent-switcher/  Now visit the above site with UA as IE  To switch your User Agent Use User Agent Switcher Addon.  Notice the new banner, asking you to install firefox even though you are using firefox (based on your user agent selected).
  • 15. Demo of How Sites Behave Differently With Different UAs - II  https://developers.facebook.com/docs/refe rence/api/permissions/  Now visit the above site with UA as IE  Asked for Login? But I don't want to Login!!!  Let's try a Google bot as UA  Yayyy!!  Let's try a blank UA  Yayy Again! :D
  • 17. Inspecting Elements with Firebug  We want to fetch the Given Sale Price (19.99)  Goto this link - http://www.payless.com/store/product/detail.jsp? catId=cat10243&subCatId=cat10243&skuId=091151050&productId=68423&lotId=091 151&category=  Right Click on $19.99 > Inspect Element with firebug
  • 19. Demo Payless_Parser.py  Run the code  $ python Payless_Parser.py  Price of this item is 19.99  Modifiy The url variable to - http://www.payless.com/store/product/deta il.jsp? catId=cat10088&subCatId=cat10243&skuI d=094079050&productId=70984&lotId=09 4079&category=&catdisplayName=Wome ns Why does this work? Try to understand.
  • 20. How about Extracting all the Permissions from this page?
  • 21. Demo Extract_Facebook_Permission s.py  Url to extract from : https://developers.facebook.com/docs/refe rence/api/permissions/  Check the next slide for Expected output and how to run the code
  • 22. $ python Extract_Facebook_Permissions.py  ['user_about_me', 'friends_about_me', 'about', 'user_activities', 'friends_activities', 'activities', 'user_birthday', 'friends_birthday', 'birthday', 'user_checkins', 'friends_checkins', 'user_education_history', 'friends_education_history', 'education', 'user_events', 'friends_events', 'events', 'user_groups', 'friends_groups', 'groups', 'user_hometown', 'friends_hometown', 'hometown', 'user_interests', 'friends_interests', 'interests', 'user_likes', 'friends_likes', 'likes', 'user_location', 'friends_location', 'location', 'user_notes', 'friends_notes', 'notes', 'user_photos', 'friends_photos', 'user_questions', 'friends_questions', 'user_relationships', 'friends_relationships', 'user_relationship_details', 'friends_relationship_details', 'user_religion_politics', 'friends_religion_politics', 'user_status', 'friends_status', 'user_videos', 'friends_videos', 'user_website', 'friends_website', 'user_work_history', 'friends_work_history', 'work', 'email', 'email', 'read_friendlists', 'read_insights', 'read_mailbox', 'read_requests', 'read_stream', 'xmpp_login', 'ads_management', 'create_event', 'manage_friendlists', 'manage_notifications', 'user_online_presence', 'friends_online_presence', 'publish_checkins', 'publish_stream', 'publish_stream', 'rsvp_event']
  • 23. How about writing our version of Google Translate API?  Important: Google Translate API v2 is now available as a paid service only, and the number of requests your application can make per day is limited. As of December 1, 2011, Google Translate API v1 is no longer available; it was officially deprecated on May 26, 2011. These decisions were made due to the substantial economic burden caused by extensive abuse. For website translations, we encourage you to use the Google Website Translator gadget.
  • 24. Let's understand how it works in background.  Use LiveHTTPHeaders To Understand this  Important Parameters that are passed  sl = en (Source Language = English)  tl = hi (Target Language = Hindi)  text = hello world  http://translate.google.com/? sl=en&tl=hi&text=hello+world#
  • 25. How about we post this converted text to our facebook wall? :)  fbconsole  Facebook Python API  Simplifies things  Very easy to install  https://github.com/facebook/fbconsole  $ sudo pip install fbconsole  We'll use the permissions we extracted in this script :)
  • 26. Demo Google_Translator_With_FB_API.py $ python Google_Translator_With_FB_API.py Language to Convert from : en Language to Convert to : hi Text to Convert : wow Converted Text : वाह  Check your facebook wall :)
  • 27. Translated Text Posted on my Facebook Wall
  • 28. What is Shodan?  Web search engines, such as Google and Bing, are great for finding websites. But what if you're interested in finding computers running a certain piece of software (such as Apache)? Or if you want to know which version of Microsoft IIS is the most popular? Or you want to see how many anonymous FTP servers there are? Maybe a new vulnerability came out and you want to see how many hosts it could infect? Traditional web search engines don't let you answer those questions.
  • 29. What is Shodan?  SHODAN is a search engine that lets you find specific computers (routers, servers, etc.) using a variety of filters.  Public port scan directory or a search engine of banners.
  • 30. Scraping Shodan Data Preview  http://www.shodanhq.com/  Python API Is available - http://docs.shodanhq.com/  But you have to get the advanced features. :-/  By default, the following search filters for Shodan are disabled: net, country, before, after. To unlock those filters buy the Unlocked API Add-On. No subscription required!  http://www.shodanhq.com/data/addons
  • 31. Demo shodanparser_New.py $ python shodanparser_New.py Query : country:IN HTTP/1.0 200 OK 3 98.146.42.77United States 178.33.70.221 France 96.217.60.25United States 115.133.223.66 Malaysia 218.250.60.122 Hong Kong 180.177.12.132 Taiwan 178.63.104.140 Germany 76.85.55.178United States 67.159.200.99 United States 75.188.142.2United States
  • 32. robots.txt  The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is different from, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.
  • 33. robots.txt  Despite the use of the terms "allow" and "disallow", the protocol is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee exclusion of all web robots. In particular, malicious web robots are unlikely to honor robots.txt
  • 34. facebook.com/robots.txt User-agent: Googlebot Disallow: /ac.php Disallow: /ae.php Disallow: /album.php Disallow: /ap.php Disallow: /autologin.php Disallow: /checkpoint/ …............
  • 35. Conculsion  Scraping has many usecases.  Most useful to write your own API if the website does not provide one or has limitations.  Very useful in combining Exiting APIs with websites that do not provide APIs  Be careful of How badly you hit a server.  Follow robots.txt or take permissions.
  • 36. References  Advance Scraping Video -  http://pyvideo.org/video/609/web- scraping-reliably-and-efficiently-pull-data  Google Python Class Intermediate  http://code.google.com/edu/languages/g oogle-python-class/set-up.html  http://www.youtube.com/watch? v=tKTZoB2Vjuk&feature=plcp&context= C42cb319VDvjVQa1PpcFMzwqYlYKVx DoyEu1ISDDTjmz370vY8Xg4%3D
  • 37. References  Python Absolute Beginner  http://www.youtube.com/watch? v=4Mf0h3HphEA&feature=channel_vide o_title  Siddhant Sanyam's PyCon 11 Slides  https://github.com/siddhant3s/PyCon11- Talk/tree/master/talk1_webscrapping
  • 38. References  http://firesofmay.blogspot.in/2011/10/http- web-scrapping-and-python-part-1.html
  • 39. from BeautifulSoup import BeautifulSoup import requests, sys url = 'http://translate.google.com/? sl=en&tl=hi&text=Thank+you+Any+Questions?' soup = BeautifulSoup(requests.get(url).content, convertEntities=BeautifulSoup.HTML_ENTITIES) print soup.find('div', {'id' : 'gt-res-content'}).find('span', {'id':'result_box'}).text