
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Extract URL from HTML Link Using Python Regular Expression
URL is an acronym for Uniform Resource Locator; it is used to identify the location resource on internet. For example, the following URLs are used to identify the location of Google and Microsoft websites −
https://www.google.com https://www.microsoft.com
URL consists of domain name, path, port number etc. The URL can be parsed and processed by using Regular Expression. Therefore, if we want to use Regular Expression we have to use re library in Python.
Example
Following is the example demonstrating URL −
URL: https://www.tutorialspoint.com/courses If we parse the above URL we can find the website name and protocol Hostname: tutorialspoint.com Protocol: https
Regular Expressions
In Python language, regular expression is one of the search pattern used to find matching strings.
Python has four methods which are used for regular expressions −
search() − It is used to find first match.
match() − it is used to find only identical match
findall() − it is used to find all matches
sub() − it is used to substitute string matching pattern with new string.
If we want to search a required pattern in URL by using Python language, we use re.findall() function which is a re library function.
Syntax
Following is the syntax or usage of searching function re.findall in python
re.findall(regex, string)
The above syntax returns all non-overlapping matches of patterns in a string as a list of strings.
Example
To extract a URL, we can use the following code −
import re text= '<p>Hello World: </p><a href="http://tutorialspoint.com">More Courses</a><a href="https://www.tutorialspoint.com/market/index.asp">Even More Courses</a>' urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text) print("Original string: ",text) print("Urls:",urls)
Output
Following is the output of the above program, when executed.
Original string: <p>Hello World: </p><a href="http://tutorialspoint.com">More Courses</a><a href="https://www.tutorialspoint.com/market/index.asp">Even More Courses</a> Urls: ['http://tutorialspoint.com', 'https://www.tutorialspoint.com/market/index.asp']
Example
The below program demonstrate how to extract the Hostname and protocol from a given URL.
import re website = 'https://www.tutorialspoint.com/' #to find protocol object1 = re.findall('(\w+)://', website) print(object1) # To find host name object2 = re.findall('://www.([\w\-\.]+)', website) print(object2)
Output
Following is the output of the above program, when executed.
['https'] ['tutorialspoint.com']
Example
Following program demonstrates the usage of general URL where path elements are constructed.
# Online Python-3 Compiler (Interpreter) import re # url url = 'http://www.tutorialspoint.com/index.html' # finding all capture groups object = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)', url) print(object)
Output
Following is the output of the above program, when executed.
[('http', 'www.tutorialspoint.com', 'index', 'html')]