SlideShare a Scribd company logo
Scraping recalcitrant web sites
              with Python & Selenium
                     Roger Barnes




SyPy July 2012
Some sites suck
Some sites suck - "for your own good"




For security reasons, each button is
an image, dynamically generated by
a hash wrapped in a mess of
javascript, randomly placed
...but they work in a web browser!




  Let's use the web browser to scrape them
Enter Selenium



      Selenium automates browsers

                 That's it
Selenium can...
●   navigate (windows, frames, links)
●   find elements and parse attributes
●   interact and trigger events (click, type, ...)
●   capture screenshots
●   run javascript
●   let the browser take care of the hard stuff
    (cookies, javascript, sessions, profiles,
    DOM)

Comes with various components and bindings
                         ... including python
General Recipe
Ingredients:
● firefox (or chrome)
● firebug (or chrome dev tools)
● Selenium IDE
    ○ record a session, write less code
●   python and its batteries
●   python-selenium
●   xvfb and pyvirtualdisplay (optional)
●   other libraries to taste
    ○ eg image manipulation, database access, DOM
      parsing, OCR
General Recipe
Method:
● Install requirements (apt-get, pip etc)
   ○ sudo apt-get install xvfb firefox
   ○ pip install selenium pyvirtualdisplay
● Start up Firefox and Selenium IDE
● Record a "test" run through site
   ○ Add in some assertions along the way
● Export test as Python script
● Hack from there
   ○ Loops
   ○ Image/data extraction
   ○ Wrangling data into a database
Scraping recalcitrant web sites with Python & Selenium
Example from Selenium IDE
class Ingdirect2(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait( 30)
        self.base_url = "https://www.ingdirect.com.au"
        self.verificationErrors = []

   def test_ingdirect2(self):
       driver = self.driver
                                                           But what about
       driver.get( self.base_url + "/client/index.aspx")
                                                           that dang
       driver.switch_to_frame( 'body') # Had to add this keypad? ...
       driver.find_element_by_id( "txtCIF").clear()
       driver.find_element_by_id( "txtCIF").send_keys( "12345678")
       driver.find_element_by_id( "objKeypad_B1").click()
       driver.find_element_by_id( "objKeypad_B2").click()
       driver.find_element_by_id( "objKeypad_B3").click()
       driver.find_element_by_id( "objKeypad_B4").click()
       driver.find_element_by_id( "btnLogin").click()
       self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
PIL saves the day
# Get screenshot for extraction of button images
screenshot = driver.get_screenshot_as_base64()
im = Image.open(StringIO.StringIO(base64.decodestring(screenshot)))

table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table')
all_buttons = table.find_elements_by_tag_name( "input")

# Determine md5sum of each button by cropping based on element positions
for button in all_buttons:
    button_image = im.crop(getcropbox(button))
    hexid = hashlib.md5(button_image.tostring()).hexdigest()
    button_mapping[hexid] = button.get_attribute( "id")


# Now we know which button is which ( based on previous lookup), enter the PIN
for char in self.pin:
    driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()

driver.find_element_by_id( "btnLogin").click()

# We're in!!!11one
But why do all this?
It's my data!                                  ... and I'll graph if i want to




       * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
That's all folks
Slides
● http://bit.ly/scrapium

Code
● https://gist.github.com/3015852

Me
● https://twitter.com/mindsocket
● https://github.com/mindsocket
● roger@mindsocket.com.au
Ad

Recommended

Javascript Test Automation Workshop (21.08.2014)
Javascript Test Automation Workshop (21.08.2014)
Deutsche Post
 
JavaScript + Jenkins = Winning!
JavaScript + Jenkins = Winning!
Eric Wendelin
 
Making the most of your Test Suite
Making the most of your Test Suite
ericholscher
 
Introduction to Selenium and Ruby
Introduction to Selenium and Ruby
Ynon Perek
 
探討Web ui自動化測試工具
探討Web ui自動化測試工具
政億 林
 
Like a Genie from a Lamp: Headless JavaScript Unit Testing with Jasmine and P...
Like a Genie from a Lamp: Headless JavaScript Unit Testing with Jasmine and P...
Rob Friesel
 
Test-Driven JavaScript Development (JavaZone 2010)
Test-Driven JavaScript Development (JavaZone 2010)
Christian Johansen
 
前端網頁自動測試
前端網頁自動測試
政億 林
 
Web driver training
Web driver training
Dipesh Bhatewara
 
Selenium Automation Using Ruby
Selenium Automation Using Ruby
Kumari Warsha Goel
 
Selenium webdriver
Selenium webdriver
sean_todd
 
Automated Testing With Watir
Automated Testing With Watir
Timothy Fisher
 
Detecting headless browsers
Detecting headless browsers
Sergey Shekyan
 
Webdriver.io
Webdriver.io
LinkMe Srl
 
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Agile Testing Alliance
 
Introduction to Protractor
Introduction to Protractor
Jie-Wei Wu
 
Jenkins and Groovy
Jenkins and Groovy
Kiyotaka Oku
 
Zombiejs
Zombiejs
Виктор Ткаченко
 
Introduction To Ruby Watir (Web Application Testing In Ruby)
Introduction To Ruby Watir (Web Application Testing In Ruby)
Mindfire Solutions
 
High Performance JavaScript 2011
High Performance JavaScript 2011
Nicholas Zakas
 
Code ceptioninstallation
Code ceptioninstallation
Andrii Lagovskiy
 
An introduction to PhantomJS: A headless browser for automation test.
An introduction to PhantomJS: A headless browser for automation test.
BugRaptors
 
Protractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applications
Ludmila Nesvitiy
 
Session on Launching Selenium Grid and Running tests using docker compose and...
Session on Launching Selenium Grid and Running tests using docker compose and...
Agile Testing Alliance
 
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Ondřej Machulda
 
Integrační testy - Selenium
Integrační testy - Selenium
Keyup
 
Protractor Tutorial Quality in Agile 2015
Protractor Tutorial Quality in Agile 2015
Andrew Eisenberg
 
淺談 Groovy 與 AWS 雲端應用開發整合
淺談 Groovy 與 AWS 雲端應用開發整合
Kyle Lin
 
Web UI test automation instruments
Web UI test automation instruments
Artem Nagornyi
 
Testing web application with Python
Testing web application with Python
Jachym Cepicky
 

More Related Content

What's hot (20)

Web driver training
Web driver training
Dipesh Bhatewara
 
Selenium Automation Using Ruby
Selenium Automation Using Ruby
Kumari Warsha Goel
 
Selenium webdriver
Selenium webdriver
sean_todd
 
Automated Testing With Watir
Automated Testing With Watir
Timothy Fisher
 
Detecting headless browsers
Detecting headless browsers
Sergey Shekyan
 
Webdriver.io
Webdriver.io
LinkMe Srl
 
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Agile Testing Alliance
 
Introduction to Protractor
Introduction to Protractor
Jie-Wei Wu
 
Jenkins and Groovy
Jenkins and Groovy
Kiyotaka Oku
 
Zombiejs
Zombiejs
Виктор Ткаченко
 
Introduction To Ruby Watir (Web Application Testing In Ruby)
Introduction To Ruby Watir (Web Application Testing In Ruby)
Mindfire Solutions
 
High Performance JavaScript 2011
High Performance JavaScript 2011
Nicholas Zakas
 
Code ceptioninstallation
Code ceptioninstallation
Andrii Lagovskiy
 
An introduction to PhantomJS: A headless browser for automation test.
An introduction to PhantomJS: A headless browser for automation test.
BugRaptors
 
Protractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applications
Ludmila Nesvitiy
 
Session on Launching Selenium Grid and Running tests using docker compose and...
Session on Launching Selenium Grid and Running tests using docker compose and...
Agile Testing Alliance
 
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Ondřej Machulda
 
Integrační testy - Selenium
Integrační testy - Selenium
Keyup
 
Protractor Tutorial Quality in Agile 2015
Protractor Tutorial Quality in Agile 2015
Andrew Eisenberg
 
淺談 Groovy 與 AWS 雲端應用開發整合
淺談 Groovy 與 AWS 雲端應用開發整合
Kyle Lin
 
Selenium Automation Using Ruby
Selenium Automation Using Ruby
Kumari Warsha Goel
 
Selenium webdriver
Selenium webdriver
sean_todd
 
Automated Testing With Watir
Automated Testing With Watir
Timothy Fisher
 
Detecting headless browsers
Detecting headless browsers
Sergey Shekyan
 
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Session on Selenium 4 : What’s coming our way? by Hitesh Prajapati
Agile Testing Alliance
 
Introduction to Protractor
Introduction to Protractor
Jie-Wei Wu
 
Jenkins and Groovy
Jenkins and Groovy
Kiyotaka Oku
 
Introduction To Ruby Watir (Web Application Testing In Ruby)
Introduction To Ruby Watir (Web Application Testing In Ruby)
Mindfire Solutions
 
High Performance JavaScript 2011
High Performance JavaScript 2011
Nicholas Zakas
 
An introduction to PhantomJS: A headless browser for automation test.
An introduction to PhantomJS: A headless browser for automation test.
BugRaptors
 
Protractor framework – how to make stable e2e tests for Angular applications
Protractor framework – how to make stable e2e tests for Angular applications
Ludmila Nesvitiy
 
Session on Launching Selenium Grid and Running tests using docker compose and...
Session on Launching Selenium Grid and Running tests using docker compose and...
Agile Testing Alliance
 
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Workshop: Functional testing made easy with PHPUnit & Selenium (phpCE Poland,...
Ondřej Machulda
 
Integrační testy - Selenium
Integrační testy - Selenium
Keyup
 
Protractor Tutorial Quality in Agile 2015
Protractor Tutorial Quality in Agile 2015
Andrew Eisenberg
 
淺談 Groovy 與 AWS 雲端應用開發整合
淺談 Groovy 與 AWS 雲端應用開發整合
Kyle Lin
 

Similar to Scraping recalcitrant web sites with Python & Selenium (20)

Web UI test automation instruments
Web UI test automation instruments
Artem Nagornyi
 
Testing web application with Python
Testing web application with Python
Jachym Cepicky
 
iOS Automation Primitives
iOS Automation Primitives
Synack
 
Owasp orlando, april 13, 2016
Owasp orlando, april 13, 2016
Mikhail Sosonkin
 
How to execute Automation Testing using Selenium
How to execute Automation Testing using Selenium
valuebound
 
Top100summit 谷歌-scott-improve your automated web application testing
Top100summit 谷歌-scott-improve your automated web application testing
drewz lin
 
Automating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on Cloud
Jonghyun Park
 
Nightwatch 101 - Salvador Molina
Nightwatch 101 - Salvador Molina
Salvador Molina (Slv_)
 
Javascript Everywhere
Javascript Everywhere
Pascal Rettig
 
UI Testing Best Practices - An Expected Journey
UI Testing Best Practices - An Expected Journey
Oren Farhi
 
Server Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yet
Tom Croucher
 
End-to-end testing with geb
End-to-end testing with geb
Jesús L. Domínguez Muriel
 
Introduction to jQuery
Introduction to jQuery
Alek Davis
 
Writing automation tests with python selenium behave pageobjects
Writing automation tests with python selenium behave pageobjects
Leticia Rss
 
OpenCms Days 2014 - User Generated Content in OpenCms 9.5
OpenCms Days 2014 - User Generated Content in OpenCms 9.5
Alkacon Software GmbH & Co. KG
 
Testing ASP.NET - Progressive.NET
Testing ASP.NET - Progressive.NET
Ben Hall
 
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
Mario Heiderich
 
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack
 
Automation - web testing with selenium
Automation - web testing with selenium
Tzirla Rozental
 
Catch a spider monkey
Catch a spider monkey
ChengHui Weng
 
Web UI test automation instruments
Web UI test automation instruments
Artem Nagornyi
 
Testing web application with Python
Testing web application with Python
Jachym Cepicky
 
iOS Automation Primitives
iOS Automation Primitives
Synack
 
Owasp orlando, april 13, 2016
Owasp orlando, april 13, 2016
Mikhail Sosonkin
 
How to execute Automation Testing using Selenium
How to execute Automation Testing using Selenium
valuebound
 
Top100summit 谷歌-scott-improve your automated web application testing
Top100summit 谷歌-scott-improve your automated web application testing
drewz lin
 
Automating Django Functional Tests Using Selenium on Cloud
Automating Django Functional Tests Using Selenium on Cloud
Jonghyun Park
 
Javascript Everywhere
Javascript Everywhere
Pascal Rettig
 
UI Testing Best Practices - An Expected Journey
UI Testing Best Practices - An Expected Journey
Oren Farhi
 
Server Side JavaScript - You ain't seen nothing yet
Server Side JavaScript - You ain't seen nothing yet
Tom Croucher
 
Introduction to jQuery
Introduction to jQuery
Alek Davis
 
Writing automation tests with python selenium behave pageobjects
Writing automation tests with python selenium behave pageobjects
Leticia Rss
 
OpenCms Days 2014 - User Generated Content in OpenCms 9.5
OpenCms Days 2014 - User Generated Content in OpenCms 9.5
Alkacon Software GmbH & Co. KG
 
Testing ASP.NET - Progressive.NET
Testing ASP.NET - Progressive.NET
Ben Hall
 
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
Mario Heiderich
 
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
StHack
 
Automation - web testing with selenium
Automation - web testing with selenium
Tzirla Rozental
 
Catch a spider monkey
Catch a spider monkey
ChengHui Weng
 
Ad

More from Roger Barnes (6)

The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...
Roger Barnes
 
Building data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemy
Roger Barnes
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013
Roger Barnes
 
Poker, packets, pipes and Python
Poker, packets, pipes and Python
Roger Barnes
 
Towards Continuous Deployment with Django
Towards Continuous Deployment with Django
Roger Barnes
 
Intro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django Apps
Roger Barnes
 
The life of a web request - techniques for measuring and improving Django app...
The life of a web request - techniques for measuring and improving Django app...
Roger Barnes
 
Building data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemy
Roger Barnes
 
Introduction to SQL Alchemy - SyPy June 2013
Introduction to SQL Alchemy - SyPy June 2013
Roger Barnes
 
Poker, packets, pipes and Python
Poker, packets, pipes and Python
Roger Barnes
 
Towards Continuous Deployment with Django
Towards Continuous Deployment with Django
Roger Barnes
 
Intro to Pinax: Kickstarting Your Django Apps
Intro to Pinax: Kickstarting Your Django Apps
Roger Barnes
 
Ad

Recently uploaded (20)

WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
WebdriverIO & JavaScript: The Perfect Duo for Web Automation
digitaljignect
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
Salesforce Summer '25 Release Frenchgathering.pptx.pdf
yosra Saidani
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 

Scraping recalcitrant web sites with Python & Selenium

  • 1. Scraping recalcitrant web sites with Python & Selenium Roger Barnes SyPy July 2012
  • 3. Some sites suck - "for your own good" For security reasons, each button is an image, dynamically generated by a hash wrapped in a mess of javascript, randomly placed
  • 4. ...but they work in a web browser! Let's use the web browser to scrape them
  • 5. Enter Selenium Selenium automates browsers That's it
  • 6. Selenium can... ● navigate (windows, frames, links) ● find elements and parse attributes ● interact and trigger events (click, type, ...) ● capture screenshots ● run javascript ● let the browser take care of the hard stuff (cookies, javascript, sessions, profiles, DOM) Comes with various components and bindings ... including python
  • 7. General Recipe Ingredients: ● firefox (or chrome) ● firebug (or chrome dev tools) ● Selenium IDE ○ record a session, write less code ● python and its batteries ● python-selenium ● xvfb and pyvirtualdisplay (optional) ● other libraries to taste ○ eg image manipulation, database access, DOM parsing, OCR
  • 8. General Recipe Method: ● Install requirements (apt-get, pip etc) ○ sudo apt-get install xvfb firefox ○ pip install selenium pyvirtualdisplay ● Start up Firefox and Selenium IDE ● Record a "test" run through site ○ Add in some assertions along the way ● Export test as Python script ● Hack from there ○ Loops ○ Image/data extraction ○ Wrangling data into a database
  • 10. Example from Selenium IDE class Ingdirect2(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait( 30) self.base_url = "https://www.ingdirect.com.au" self.verificationErrors = [] def test_ingdirect2(self): driver = self.driver But what about driver.get( self.base_url + "/client/index.aspx") that dang driver.switch_to_frame( 'body') # Had to add this keypad? ... driver.find_element_by_id( "txtCIF").clear() driver.find_element_by_id( "txtCIF").send_keys( "12345678") driver.find_element_by_id( "objKeypad_B1").click() driver.find_element_by_id( "objKeypad_B2").click() driver.find_element_by_id( "objKeypad_B3").click() driver.find_element_by_id( "objKeypad_B4").click() driver.find_element_by_id( "btnLogin").click() self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
  • 11. PIL saves the day # Get screenshot for extraction of button images screenshot = driver.get_screenshot_as_base64() im = Image.open(StringIO.StringIO(base64.decodestring(screenshot))) table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table') all_buttons = table.find_elements_by_tag_name( "input") # Determine md5sum of each button by cropping based on element positions for button in all_buttons: button_image = im.crop(getcropbox(button)) hexid = hashlib.md5(button_image.tostring()).hexdigest() button_mapping[hexid] = button.get_attribute( "id") # Now we know which button is which ( based on previous lookup), enter the PIN for char in self.pin: driver.find_element_by_id(button_mapping[hex_mapping[char]]).click() driver.find_element_by_id( "btnLogin").click() # We're in!!!11one
  • 12. But why do all this? It's my data! ... and I'll graph if i want to * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
  • 13. That's all folks Slides ● http://bit.ly/scrapium Code ● https://gist.github.com/3015852 Me ● https://twitter.com/mindsocket ● https://github.com/mindsocket ● [email protected]