Open In App

How to Extract PDF Tables in Python?

Last Updated : 27 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

When handling data in PDF files, you may need to extract tables for use in Python programs. PDFs (Portable Document Format) preserve the layout of text, images and tables across platforms, making them ideal for sharing consistent document formats. For example, a PDF might contain a table like:

User_IDNameOccupation
1DavidProduct Manage
2LeoIT Administrator
3JohnLawyer


And we want to read this table into our Python Program. This problem can be solved using several approaches. Let's discuss each one by one.

Using pdfplumber

If you want a straightforward way to peek inside your PDF and pull out tables without too much hassle, pdfplumber is a great choice. It carefully looks at each page and finds the tables by understanding the layout, then gives you the rows and columns so you can use them in your program.

Python
import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    for p in pdf.pages:
        for t in p.extract_tables():
            for r in t:
                print(r)

Output

Output
Using pdf.plumber

Explanation: This code uses pdfplumber.open() to safely open the PDF, iterates through pages with pdf.pages, extracts tables using extract_tables() and prints each row as a list of cell values for easy readability.

Using camelot

When your PDF has nicely drawn tables with clear lines or spaces, Camelot works wonders. It’s like a smart scanner that spots these tables and turns them into neat data frames you can easily handle in Python. It’s very handy if you want quick and clean results and PDF file used here is PDF.

Python
import camelot

# Read tables
a = camelot.read_pdf("test.pdf")

# Print first table
print(a[0].df)

Output


 

Explanation: camelot.read_pdf() extract tables from the PDF file "test.pdf". It stores all detected tables in the variable a. The first table (a[0]) is then accessed and its content is printed as a DataFrame using .df .

Using Tabula-py

If you don’t mind installing a bit of Java on your computer, Tabula-py is a powerful helper that uses a popular Java tool behind the scenes. It’s super good at grabbing tables from PDFs, even complex ones, and hands you the data as tidy tables inside Python.

Python
from tabula import read_pdf
from tabulate import tabulate

df = read_pdf("abc.pdf",pages="all") #address of pdf file
print(tabulate(df))

Output

Explanation: This code uses read_pdf() from Tabula-py to extract tables from all pages of "abc.pdf" into a DataFrame df. It then prints the DataFrame in a clean, formatted table style using tabulate().

Using PyMUPDF

Sometimes, tables aren’t perfectly formatted, or you want all the text details, not just tables. PyMuPDF lets you open PDFs and extract all the text, giving you full control. It doesn’t automatically find tables, but if you’re ready to do some manual work, it’s a flexible tool.

Python
import fitz 

d = fitz.open("example.pdf")
for p in d:
    t = p.get_text("dict")
    print(t)  

Output

Output
Using PyMUPDF

Explanation: This code opens the PDF file "example.pdf" using PyMuPDF (fitz). It loops through each page, extracts the page’s text as a detailed dictionary (get_text("dict")), which includes text blocks, fonts and layout info, then prints this structured text data.


Next Article
Article Tags :
Practice Tags :

Similar Reads