Extract text from PDF File using Python
Last Updated :
09 Aug, 2024
All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.
We will extract text from pdf files using two Python libraries, pypdf and PyMuPDF, in this article.
Extracting text from a PDF file using the pypdf library.
Python package pypdf can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files. Note: For more information, refer to Working with PDF files in Python
Installation
To install this package type the below command in the terminal.
pip install pypdf
Example: Input PDF: 
Python
# importing required modules
from pypdf import PdfReader
# creating a pdf reader object
reader = PdfReader('example.pdf')
# printing number of pages in pdf file
print(len(reader.pages))
# getting a specific page from the pdf file
page = reader.pages[0]
# extracting text from page
text = page.extract_text()
print(text)
Output:
Let us try to understand the above code in chunks:
reader = PdfReader('example.pdf')
- We created an object of PdfReader class from the pypdf module.
- The PdfReader class takes a required positional argument of the path to the pdf file.
print(len(reader.pages))
- pages property gives a List of PageObjects. So, here we can use the in-built len() function of python to get the number of pages in the pdf file.
page = reader.pages[0]
- Now, as reader.pages is a list of PageObjects, we can get a specific Page of the pdf by tapping into the index of the page. In python list indexing starts from 0, so reader.pages[0] gives us the first page of the pdf file.
text = page.extract_text()
print(text)
- Page object has function extract_text() to extract text from the pdf page.
Extracting text from a PDF file using the PyMuPDF library.
PyMuPDF is a Python library that supports file formats like XPS, PDF, CBR, and CBZ. But for now, in this article, we are going to concentrate on PDF (Portable Document Format) files.
Installation
pip install pymupdf
pip install fitz
To extract the text from the pdf, we need to follow the following steps:
- Importing the library
- Opening document
- Extracting text
Note: We are using the sample.pdf here; to get the pdf, use the link below.
sample.pdf - Link
1. Importing the library
Python
2. Opening document
Python
doc = fitz.open('sample.pdf')
Here we created an object called "doc," and filename should be a Python string.
3. Extracting text
Python
for page in doc:
text = page.get_text()
print(text)
Here, we iterated pages in pdf and used the get_text() method to extract each page from the file.
All the Code to extract the text
Python
import fitz
doc = fitz.open('sample.pdf')
text = ""
for page in doc:
text+=page.get_text()
print(text)
Output:

Conclusion
We have seen two Python libraries, pypdf and PyMuPDF, that can extract text from a PDF file. Comment on your preferred library from the above two libraries.
Similar Reads
Convert PDF to TXT File Using Python
We have a PDF file and want to extract its text into a simple .txt format. The idea is to automate this process so the content can be easily read, edited, or processed later. For example, a PDF with articles or reports can be converted into plain text using just a few lines of Python. In this articl
2 min read
Python Extract Substring Using Regex
Python provides a powerful and flexible module called re for working with regular expressions. Regular expressions (regex) are a sequence of characters that define a search pattern, and they can be incredibly useful for extracting substrings from strings. In this article, we'll explore four simple a
2 min read
Get the File Extension from a URL in Python
Handling URLs in Python often involves extracting valuable information, such as file extensions, from the URL strings. However, this task requires careful consideration to ensure the safety and accuracy of the extracted data. In this article, we will explore four approaches to safely get the file ex
2 min read
Check If a Text File Empty in Python
Before performing any operations on your required file, you may need to check whether a file is empty or has any data inside it. An empty file is one that contains no data and has a size of zero bytes. In this article, we will look at how to check whether a text file is empty using Python.Check if a
4 min read
Print the Content of a Txt File in Python
Python provides a straightforward way to read and print the contents of a .txt file. Whether you are a beginner or an experienced developer, understanding how to work with file operations in Python is essential. In this article, we will explore some simple code examples to help you print the content
3 min read
Find the Mime Type of a File in Python
Determining the MIME (Multipurpose Internet Mail Extensions) type of a file is essential when working with various file formats. Python provides several libraries and methods to efficiently check the MIME type of a file. In this article, we'll explore different approaches to find the mime type of a
3 min read
Extract Data From JustDial using Selenium
Let us see how to extract data from Justdial using Selenium and Python. Justdial is a company that provides local search for different services in India over the phone, website and mobile apps. In this article we will be extracting the following data: Phone numberNameAddress We can then save the dat
2 min read
Extract Elements from a Python List
When working with lists in Python, we often need to extract specific elements. The easiest way to extract an element from a list is by using its index. Python uses zero-based indexing, meaning the first element is at index 0. Pythonx = [10, 20, 30, 40, 50] # Extracts the last element a = x[0] print(
2 min read
Python Program to Get the File Name From the File Path
In this article, we will be looking at the program to get the file name from the given file path in the Python programming language. Sometimes during automation, we might need the file name extracted from the file path. Better to have knowledge of:Python OS-modulePython path moduleRegular expression
5 min read
Fastest Way to Read Excel File in Python
Reading Excel files is a common task in data analysis and processing. Python provides several libraries to handle Excel files, each with its advantages in terms of speed and ease of use. This article explores the fastest methods to read Excel files in Python.Using pandaspandas is a powerful and flex
3 min read