100% found this document useful (1 vote)

203 views21 pages

Python BeautifulSoup - Parse HTML, XML Documents in Python

This document provides an overview of the Python BeautifulSoup library for parsing HTML and XML documents. It discusses how to install BeautifulSoup and lxml, create BeautifulSoup objects, extract tags and attributes, traverse the document tree, find elements by id or other criteria, and use BeautifulSoup for basic web scraping tasks. Examples are provided for common operations like getting tag names and contents, finding children and descendants, and searching for elements.

Uploaded by

Juan Cuartas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

203 views21 pages

Python BeautifulSoup - Parse HTML, XML Documents in Python

Uploaded by

Juan Cuartas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

ZetCode

All Spring Boot Python C# Java JavaScript Subscribe

Python BeautifulSoup
last modified July 27, 2020

Python BeautifulSoup tutorial is an introductory tutorial to BeautifulSoup Python library. The

examples find tags, traverse document tree, modify document, and scrape web pages.

BeautifulSoup
BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web
scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python
objects, such as tag, navigable string, or comment.

Installing BeautifulSoup
We use the pip3 command to install the necessary modules.

$ sudo pip3 install lxml

We need to install the lxml module, which is used by BeautifulSoup.

$ sudo pip3 install bs4

BeautifulSoup is installed with the above command.

The HTML file

In the examples, we will use the following HTML file:

index.html
<!DOCTYPE html>
<html>
<head>
<title>Header</title>
<meta charset="utf-8">
</head>

<body>
<h2>Operating systems</h2>

<ul id="mylist" style="width:150px">

<li>Solaris</li>
<li>FreeBSD</li>
<li>Debian</li>
<li>NetBSD</li>
<li>Windows</li>
</ul>

<p>
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.
</p>

<p>
Debian is a Unix-like computer operating system that is
composed entirely of free software.
</p>

</body>
</html>
Python BeautifulSoup simple example
In the first example, we use BeautifulSoup module to get three tags.

simple.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.h2)
print(soup.head)
print(soup.li)

The code example prints HTML code of three tags.

from bs4 import BeautifulSoup

We import the BeautifulSoup class from the bs4 module. The BeautifulSoup is the main class
for doing work.

with open('index.html', 'r') as f:

contents = f.read()
Google Ads -
Sitio O cial
Con Google Ads, no hay
contratos ni mínimo de
Google Ads inversión.

We open the index.html file and read its contents with the read method.

soup = BeautifulSoup(contents, 'lxml')

A BeautifulSoup object is created; the HTML data is passed to the constructor. The second option
specifies the parser.

print(soup.h2)
print(soup.head)

Here we print the HTML code of two tags: h2 and head.

print(soup.li)

There are multiple li elements; the line prints the first one.

$ ./simple.py
<h2>Operating systems</h2>
<head>
<title>Header</title>
<meta charset="utf-8"/>
</head>
<li>Solaris</li>

This is the output.

BeautifulSoup tags, name, text

The name attribute of a tag gives its name and the text attribute its text content.

tags_names.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(f'HTML: {soup.h2}, name: {soup.h2.name}, text: {soup.h2.text}')

The code example prints HTML code, name, and text of the h2 tag.
$ ./tags_names.py
HTML: <h2>Operating systems</h2>, name: h2, text: Operating systems

This is the output.

BeautifulSoup traverse tags

With the recursiveChildGenerator method we traverse the HTML document.

traverse_tree.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

for child in soup.recursiveChildGenerator():

if child.name:
print(child.name)

The example goes through the document tree and prints the names of all HTML tags.

$ ./traverse_tree.py
html
head
title
meta
body
h2
ul
li
li
li
li
li
p
p

In the HTML document we have these tags.

BeautifulSoup element children

With the children attribute, we can get the children of a tag.

get_children.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

root = soup.html
root_childs = [e.name for e in root.children if e.name is not None]
print(root_childs)

The example retrieves children of the html tag, places them into a Python list and prints them to
the console. Since the children attribute also returns spaces between the tags, we add a condition
to include only the tag names.

$ ./get_children.py
['head', 'body']

The html tags has two children: head and body.

BeautifulSoup element descendants

With the descendants attribute we get all descendants (children of all levels) of a tag.

get_descendants.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

root = soup.body

root_childs = [e.name for e in root.descendants if e.name is not None]

print(root_childs)

The example retrieves all descendants of the body tag.

$ ./get_descendants.py
['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p']

These are all the descendants of the body tag.

BeautifulSoup web scraping

Requests is a simple Python HTTP library. It provides methods for accessing Web resources via
HTTP.

scraping.py
#!/usr/bin/python

from bs4 import BeautifulSoup

import requests as req

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

The example retrieves the title of a simple web page. It also prints its parent.

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

We get the HTML data of the page.

print(soup.title)
print(soup.title.text)
print(soup.title.parent)

We retrieve the HTML code of the title, its text, and the HTML code of its parent.

$ ./scraping.py
<title>My html page</title>
My html page
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>My html page</title>
</head>

This is the output.

BeautifulSoup prettify code

With the prettify method, we can make the HTML code look better.

prettify.py
#!/usr/bin/python

from bs4 import BeautifulSoup

import requests as req

resp = req.get('http://webcode.me')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.prettify())

We prettify the HTML code of a simple web page.

$ ./prettify.py
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>
My html page
</title>
</head>
<body>
<p>
Today is a beautiful day. We go swimming and fishing.
</p>
<p>
Hello there. How are you?
</p>
</body>
</html>

This is the output.

BeautifulSoup scraping with built-in web server

We can also serve HTML pages with a simple built-in HTTP server.

$ mkdir public
$ cp index.html public/

We create a public directory and copy the index.html there.

$ python -m http.server --directory public

Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

Then we start the Python HTTP server.

scraping2.py
#!/usr/bin/python

from bs4 import BeautifulSoup

import requests as req

resp = req.get('http://localhost:8000/')

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title)
print(soup.body)

Now we get the document from the locally running server.

BeautifulSoup find elements by Id

With the find method we can find elements by various means including element id.

find_by_id.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

#print(soup.find('ul', attrs={ 'id' : 'mylist'}))

print(soup.find('ul', id='mylist'))

The code example finds ul tag that has mylist id. The commented line has is an alternative way of
doing the same task.

BeautifulSoup find all tags

With the find_all method we can find all elements that meet some criteria.

find_all.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

for tag in soup.find_all('li'):

print(f'{tag.name}: {tag.text}')
The code example finds and prints all li tags.

$ ./find_all.py
li: Solaris
li: FreeBSD
li: Debian
li: NetBSD
li: Windows

This is the output.

The find_all method can take a list of elements to search for.

find_all2.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

tags = soup.find_all(['h2', 'p'])

for tag in tags:

print(' '.join(tag.text.split()))

The example finds all h2 and p elements and prints their text.
The find_all method can also take a function which determines what elements should be
returned.

find_by_fun.py
#!/usr/bin/python

from bs4 import BeautifulSoup

def myfun(tag):

return tag.is_empty_element

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

tags = soup.find_all(myfun)
print(tags)
The example prints empty elements.

$ ./find_by_fun.py
[<meta charset="utf-8"/>]

The only empty element in the document is meta.

It is also possible to find elements by using regular expressions.

regex.py
#!/usr/bin/python

import re

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

strings = soup.find_all(string=re.compile('BSD'))

for txt in strings:

print(' '.join(txt.split()))

The example prints content of elements that contain 'BSD' string.

$ ./regex.py
FreeBSD
NetBSD
FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded plat

This is the output.

BeautifulSoup CSS selectors

With the select and select_one methods, we can use some CSS selectors to find elements.

select_nth_tag.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.select('li:nth-of-type(3)'))

This example uses a CSS selector to print the HTML code of the third li element.

$ ./select_nth_tag.py
<li>Debian</li>

This is the third li element.

The # character is used in CSS to select tags by their id attributes.

select_by_id.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.select_one('#mylist'))

The example prints the element that has mylist id.

BeautifulSoup append element

The append method appends a new tag to the HTML document.

append_tag.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

newtag = soup.new_tag('li')
newtag.string='OpenBSD'

ultag = soup.ul

ultag.append(newtag)

print(ultag.prettify())

The example appends a new li tag.

newtag = soup.new_tag('li')
newtag.string='OpenBSD'

First, we create a new tag with the new_tag method.

ultag = soup.ul

We get the reference to the ul tag.

ultag.append(newtag)

We append the newly created tag to the ul tag.

print(ultag.prettify())

We print the ul tag in a neat format.

BeautifulSoup insert element

The insert method inserts a tag at the specified location.

insert_tag.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

newtag = soup.new_tag('li')
newtag.string='OpenBSD'

ultag = soup.ul

ultag.insert(2, newtag)

print(ultag.prettify())

The example inserts a li tag at the third position into the ul tag.

BeautifulSoup replace text

The replace_with replaces a text of an element.

replace_text.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()
soup = BeautifulSoup(contents, 'lxml')

tag = soup.find(text='Windows')
tag.replace_with('OpenBSD')

print(soup.ul.prettify())

The example finds a specific element with the find method and replaces its content with the
replace_with method.

BeautifulSoup remove element

The decompose method removes a tag from the tree and destroys it.

decompose_tag.py
#!/usr/bin/python

from bs4 import BeautifulSoup

with open('index.html', 'r') as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

ptag2 = soup.select_one('p:nth-of-type(2)')

ptag2.decompose()

print(soup.body.prettify())

The example removes the second p element.

In this tutorial, we have worked with the Python BeautifulSoup library.

Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
Fonts in Tkinter
No ratings yet
Fonts in Tkinter
7 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
What Is Python 3?
No ratings yet
What Is Python 3?
2 pages
Write Python Instead of SQL!: An Introduction To Sqlalchemy
No ratings yet
Write Python Instead of SQL!: An Introduction To Sqlalchemy
10 pages
CheatSheet Python 1 Keywords1
No ratings yet
CheatSheet Python 1 Keywords1
1 page
SQL Alchemy
No ratings yet
SQL Alchemy
1,088 pages
Python Cheatsheet
100% (2)
Python Cheatsheet
51 pages
Dictionary in Python
No ratings yet
Dictionary in Python
11 pages
A Survey On IoT Intrusion Detection Federated Learning Game Theory Social Psychology and Explainable A
No ratings yet
A Survey On IoT Intrusion Detection Federated Learning Game Theory Social Psychology and Explainable A
34 pages
Python, Install PIP
No ratings yet
Python, Install PIP
18 pages
Python For Programmers - A Project-Based Tutorial
No ratings yet
Python For Programmers - A Project-Based Tutorial
131 pages
Python Arsenal For RE
No ratings yet
Python Arsenal For RE
53 pages
Beautiful Soup Documentation: Getting Help
100% (1)
Beautiful Soup Documentation: Getting Help
56 pages
Beginners Python Cheat Sheets Sample
100% (1)
Beginners Python Cheat Sheets Sample
8 pages
Faster Python Programs Through Optimization PDF
No ratings yet
Faster Python Programs Through Optimization PDF
2 pages
Topic 1 Basics: Add Two Numbers
No ratings yet
Topic 1 Basics: Add Two Numbers
57 pages
Pandas
No ratings yet
Pandas
2,977 pages
Digital Dnyan Academy - Best Python Training in Pune - Join Us Today
No ratings yet
Digital Dnyan Academy - Best Python Training in Pune - Join Us Today
10 pages
Intro To Crypto Jon Callas
No ratings yet
Intro To Crypto Jon Callas
85 pages
LAB Manual
No ratings yet
LAB Manual
100 pages
Python Requests Essentials - Sample Chapter
No ratings yet
Python Requests Essentials - Sample Chapter
17 pages
400 Python Exercise
No ratings yet
400 Python Exercise
27 pages
Python Content Manual
No ratings yet
Python Content Manual
95 pages
Deploying Flask Apps Easily
No ratings yet
Deploying Flask Apps Easily
10 pages
Python Specialization2
No ratings yet
Python Specialization2
3 pages
Exif File Format
No ratings yet
Exif File Format
25 pages
Python
100% (5)
Python
19 pages
Python Basic
No ratings yet
Python Basic
109 pages
Flask Docs
100% (1)
Flask Docs
300 pages
Python - Programming
No ratings yet
Python - Programming
9 pages
Python
No ratings yet
Python
35 pages
Google Ads Course 3
No ratings yet
Google Ads Course 3
48 pages
Python Cheat Sheet For Excel Users
No ratings yet
Python Cheat Sheet For Excel Users
5 pages
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet
Python 201
No ratings yet
Python 201
15 pages
Essntial Guide To Machine Data
No ratings yet
Essntial Guide To Machine Data
130 pages
HTML DHTML and Javascript
94% (31)
HTML DHTML and Javascript
230 pages
Python Honors Notes
No ratings yet
Python Honors Notes
130 pages
Api-Demo: Platform-As-A-Service (Paas) Based Solution
No ratings yet
Api-Demo: Platform-As-A-Service (Paas) Based Solution
6 pages
CSE 4235c Python Assignment 01
No ratings yet
CSE 4235c Python Assignment 01
14 pages
Learning Python
100% (3)
Learning Python
210 pages
Python Scripting Essentials.: Rejah Rehim
50% (4)
Python Scripting Essentials.: Rejah Rehim
20 pages
Data Analysis Using Python (Python For Beginners) - CloudxLab
No ratings yet
Data Analysis Using Python (Python For Beginners) - CloudxLab
152 pages
Python Wikibooks
No ratings yet
Python Wikibooks
110 pages
Top 100 Python Interview Questions & Answers For 2021 - Edureka
No ratings yet
Top 100 Python Interview Questions & Answers For 2021 - Edureka
24 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
26 pages
WebRTC Blueprints
From Everand
WebRTC Blueprints
Andrii Sergiienko
No ratings yet
Mastering the Art of Julia Programming: Advanced Techniques for Expert-Level Programming
From Everand
Mastering the Art of Julia Programming: Advanced Techniques for Expert-Level Programming
Steve Jones
No ratings yet
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
DotNetNuke 5.4 Cookbook
From Everand
DotNetNuke 5.4 Cookbook
John K Murphy
5/5 (1)
Programming ASP.NET
From Everand
Programming ASP.NET
Nino Paiotta
No ratings yet
Offline First Web Development: Design and build robust offline-first apps for exceptional user experience even when an internet connection is absent
From Everand
Offline First Web Development: Design and build robust offline-first apps for exceptional user experience even when an internet connection is absent
Daniel Sauble
No ratings yet
PHP & MySQL Practice It Learn It
From Everand
PHP & MySQL Practice It Learn It
Jitendra Patel
3/5 (2)
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Easy html and css
From Everand
Easy html and css
S VASIST
No ratings yet
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
Learn Html In 1 Hour
From Everand
Learn Html In 1 Hour
John Bura
No ratings yet
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Python List Comprehensions - Learn Python List Comprehensions
No ratings yet
Python List Comprehensions - Learn Python List Comprehensions
12 pages
Python Urllib3 - Accessing Web Resources Via HTTP
No ratings yet
Python Urllib3 - Accessing Web Resources Via HTTP
19 pages
Python Magic Methods - Using Magic Methods in Python
No ratings yet
Python Magic Methods - Using Magic Methods in Python
18 pages
Python F-String - Formatting Strings in Python With F-String
No ratings yet
Python F-String - Formatting Strings in Python With F-String
13 pages
Python CSV - Read, Write CSV in Python
100% (1)
Python CSV - Read, Write CSV in Python
11 pages
Python Decorators - Using Decorator Functions in Python
No ratings yet
Python Decorators - Using Decorator Functions in Python
15 pages
Python Create Dictionary - Creating Dictionaries in Python
No ratings yet
Python Create Dictionary - Creating Dictionaries in Python
8 pages
Python Click - Creating Command Line Interfaces
No ratings yet
Python Click - Creating Command Line Interfaces
19 pages
Tanviiee006 REPORT
No ratings yet
Tanviiee006 REPORT
60 pages
RoboDK Doc EN Robots Fanuc
No ratings yet
RoboDK Doc EN Robots Fanuc
10 pages
BV400 Series 20231010
No ratings yet
BV400 Series 20231010
6 pages
Living in The IT Era Chapter 2
No ratings yet
Living in The IT Era Chapter 2
38 pages
LSMW Migration With IDOC Method and Using IDOC As Source
No ratings yet
LSMW Migration With IDOC Method and Using IDOC As Source
85 pages
Internship Presentation 20CE05
No ratings yet
Internship Presentation 20CE05
24 pages
What Is A Client-Server Model?
No ratings yet
What Is A Client-Server Model?
4 pages
Google Passguide Cloud-Digital-Leader Actual Test 2023-Jul-21 by Marcus 91q Vce
100% (2)
Google Passguide Cloud-Digital-Leader Actual Test 2023-Jul-21 by Marcus 91q Vce
29 pages
Manual For The Self-Taught Computer Scientist: January 2018
No ratings yet
Manual For The Self-Taught Computer Scientist: January 2018
2 pages
Silabus Literatura NG Pilipinas
No ratings yet
Silabus Literatura NG Pilipinas
14 pages
HPE - A50004306enw - HPE ProLiant DL360 Gen11
No ratings yet
HPE - A50004306enw - HPE ProLiant DL360 Gen11
109 pages
CompTIA Security+ Certification Exam Objectives
No ratings yet
CompTIA Security+ Certification Exam Objectives
11 pages
Measuring and Reporting Performance
No ratings yet
Measuring and Reporting Performance
3 pages
6081.hostel Management MANISH
No ratings yet
6081.hostel Management MANISH
27 pages
How To Setup MikroTik WiFi Router As Repeater Mode
No ratings yet
How To Setup MikroTik WiFi Router As Repeater Mode
1 page
Password Based Doorlock System in 8051
No ratings yet
Password Based Doorlock System in 8051
11 pages
Resume Screening Using Machine Learning
No ratings yet
Resume Screening Using Machine Learning
5 pages
Focused Use Cases
No ratings yet
Focused Use Cases
19 pages
Python Comments: Creating Variables
100% (1)
Python Comments: Creating Variables
37 pages
BE-801225.02-MOBA NMS-v2 User Guide
No ratings yet
BE-801225.02-MOBA NMS-v2 User Guide
36 pages
Brief History of Internet
No ratings yet
Brief History of Internet
47 pages
ErrMsg Eng
No ratings yet
ErrMsg Eng
8 pages
CS 6250 - Computer Networks - OMSCS - Georgia Institute of Technology - Atlanta, GA
No ratings yet
CS 6250 - Computer Networks - OMSCS - Georgia Institute of Technology - Atlanta, GA
3 pages
Instanceof Keyword in Java - GeeksforGeeks
No ratings yet
Instanceof Keyword in Java - GeeksforGeeks
7 pages
Input Devices
No ratings yet
Input Devices
10 pages
ACONIS Maintenance
No ratings yet
ACONIS Maintenance
15 pages
D. Errors-To Header
100% (1)
D. Errors-To Header
29 pages
Data Structures and Algorithms (DSA) : July 2019
No ratings yet
Data Structures and Algorithms (DSA) : July 2019
4 pages
PostNuke Getting Started Guide Small
No ratings yet
PostNuke Getting Started Guide Small
60 pages
Jio Solution Brief Hyperscale Cloud Native 5g Core
No ratings yet
Jio Solution Brief Hyperscale Cloud Native 5g Core
8 pages