Extract Tables from PDF with Python — Complete Code Guide 2026

Q: Can Python extract tables from PDF?

Yes. Python libraries such as tabula-py, Camelot, and PDFPlumber read born-digital PDFs and return tables as pandas DataFrames or nested lists. For scanned PDFs you must run OCR first (for example Tesseract via pytesseract), then parse the text or use layout-aware tools. Production pipelines often combine extraction, validation, and export to Excel or CSV.

Q: Which Python library is best for PDF table extraction?

There is no single winner. tabula-py is fastest to try on clear digital tables. Camelot excels on ruled grids and merged cells with lattice or stream flavors. PDFPlumber offers fine-grained control over text and table boundaries. PyMuPDF is best for speed and text; pdfminer.six for low-level parsing. Choose based on whether your PDFs are digital or scanned and how complex the layout is.

Q: How do I extract tables from scanned PDFs in Python?

Convert each page to an image (pdf2image), run OCR with pytesseract or a cloud API, then rebuild table structure with regex, layout heuristics, or an AI service. Tabula, Camelot, and PDFPlumber do not read image-only PDFs reliably. Platforms like Inputo run OCR and table detection together and export to Excel without custom code.

Q: Can I export extracted tables to Excel with Python?

Yes. pandas DataFrames from tabula-py and Camelot support to_excel() with openpyxl or xlsxwriter. PDFPlumber returns lists of rows that you can wrap in a DataFrame before export. For multiple tables per document, write one sheet per table or one file per table depending on your downstream workflow.

Q: Do I need OCR for PDF table extraction?

You need OCR only when the PDF has no selectable text layer — typical of scans, photos, and faxed documents. Born-digital PDFs exported from Excel, accounting software, or browsers already contain vector text; table libraries read those directly. If copy-paste from the PDF yields gibberish or nothing, assume OCR is required.

Q: What if my PDF has complex merged cells?

Use Camelot with flavor lattice for bordered tables, or stream for whitespace-aligned tables. PDFPlumber lets you tune table settings (vertical strategy, snap tolerance). Merged cells often need manual cleanup in Excel or custom post-processing because automatic detectors split or duplicate spans. AI-based extractors infer header spans more robustly on irregular layouts.

Bank statements, invoices, scientific papers, and payroll reports often arrive as PDFs — and the numbers you need live inside tables, not in plain paragraphs. Manually retyping rows into Excel wastes hours and invites errors. Python is the language teams reach for first when they need to extract tables from PDF at scale: rich libraries, direct export to pandas and Excel, and easy integration with data pipelines, cron jobs, and APIs.

If you searched for extract tables from pdf python, extract table from pdf, or extract table data from pdf, you want working code, not theory alone. This guide compares the main libraries, walks through three copy-paste methods (Tabula, Camelot, PDFPlumber), covers scanned documents with OCR, and explains when to move from a local script to a production platform like Inputo.

Prerequisites: Python 3.9+, a virtual environment, and sample PDFs that contain real tables (not only images). Copy each block into your project and adjust file paths and page ranges to match your documents.

Why Python is ideal for PDF table extraction

PDF is a presentation format, not a database. Tables are drawn with lines, text positioning, and invisible structure that varies by vendor. Python’s ecosystem treats that mess as a solvable engineering problem rather than a one-off chore.

DataFrames are native. tabula-py and Camelot return pandas tables immediately. You filter, merge, validate dtypes, and call to_excel() without leaving the language your analysts already use.

Automation is straightforward. A twenty-line script can watch a folder, process new invoices nightly, and upload results to S3 or a database. The same patterns you use for ETL apply to document ingestion.

Debugging is transparent. Unlike closed desktop tools, you can log intermediate rows, plot Camelot’s table overlays, or tune PDFPlumber’s detection settings when a layout breaks.

Composition with AI and OCR. When rule-based extractors fail on scans, Python glue code connects Tesseract, cloud vision APIs, or services like Inputo that combine OCR with layout understanding — see our broader guide on how to extract data from PDF files using AI.

The trade-off is maintenance: Java runtimes for Tabula, Ghostscript for Camelot, and edge-case PDFs that need per-template tuning. For many finance and operations teams, that cost is still lower than manual entry — until volume and scan quality push you toward a managed extractor.

Why extract tables from PDF?

Organisations sit on terabytes of “trapped” data: values visible on screen but not in a queryable database. Tables are the densest trap — salaries by employee, line items on invoices, trial results in research PDFs, transaction grids on bank exports.

Banking and finance. Monthly statements list debits, credits, and balances in multi-page grids. Risk and accounting teams need those rows in Excel or a warehouse for reconciliation and forecasting.

Invoices and purchase orders. Vendor PDFs mix header fields with line-item tables (description, quantity, unit price, tax). AP automation starts by pulling the table reliably; field-level parsing is the next step — covered in depth in AI invoice processing.

Scientific and regulatory reports. Journals and agencies publish supplementary tables as PDF attachments. Researchers extract cohort statistics, adverse events, or emissions data for meta-analysis.

Payroll and HR. Payslips and social-security filings repeat employee blocks with earnings, deductions, and employer contributions. Gestorías and HRIS imports expect one row per employee, not a stack of PDF pages.

Operations and logistics. Packing lists, customs declarations, and inventory snapshots arrive as PDF tables from partners who will never send a CSV.

Extracting tables programmatically turns archival documents into live datasets. The business question is not whether extraction is possible, but which tool matches your PDF type (digital vs scanned) and your accuracy bar before human review.

Top Python libraries for PDF table extraction

No single package wins every document. The table below summarises six common choices. Test on your files: a library that shines on US IRS forms may struggle on European invoices with comma decimals and merged VAT columns.

Library	Best for	Pros	Cons
tabula-py	Structured tables in digital PDFs	Easy API, accurate on clear grids, returns DataFrames	Requires Java; no OCR for scanned PDFs
camelot	Complex tables with merged cells	High accuracy on ruled tables; visual debugging (`plot()`)	Slower; no OCR; Ghostscript dependency
pdfplumber	Precise text + table extraction	Detailed metadata; strong on mixed text/table pages	No native OCR; tuning needed on borderless tables
PyMuPDF (fitz)	Speed, text extraction, rendering	Very fast, lightweight, good for text-first pipelines	Table detection less accurate than dedicated tools
pandas + tabula	Data analysis pipelines	Direct to DataFrame; fits Jupyter and Airflow workflows	Still needs tabula (or another extractor) for OCR scans
pdfminer.six	Low-level PDF parsing	Full control; pure Python option for text coordinates	Steep learning curve; you build table logic yourself

Installation snapshot: pip install tabula-py pandas openpyxl for Method 1; pip install camelot-py[cv] openpyxl for Method 2; pip install pdfplumber for Method 3. Install system Java for Tabula and verify Ghostscript for Camelot per their docs.

Method 1: Extract tables with Tabula-py

Tabula wraps the Tabula Java engine. It works best on born-digital PDFs where tables have visible structure and selectable text. Point it at an invoice or statement, and you get a list of pandas DataFrames — one per detected table.

Install with pip install tabula-py and ensure Java is on your PATH. On Linux you may need apt install default-jre or equivalent.

The script below reads every page, prints the head of each table for inspection, and writes separate Excel files. Adjust pages to a range like "1-3" for faster tests.

import tabula
# Read PDF into DataFrame list
tables = tabula.read_pdf("invoice.pdf", pages="all", multiple_tables=True)
for i, table in enumerate(tables):
    print(f"Table {i+1}:")
    print(table.head())
    table.to_excel(f"table_{i+1}.xlsx", index=False)

When Tabula excels: US tax forms, simple invoice line items, CSV-like grids exported from accounting software. When it struggles: borderless tables, heavy rotation, or PDFs that are photographs of paper — Tabula sees pixels, not cells, unless you OCR first.

Tips: Use lattice=True or stream=True (Tabula’s two modes) when defaults miss tables. Pass area=[top, left, bottom, right] in PDF points to crop a region if headers confuse detection. Combine with pandas for concatenation: pd.concat(tables, ignore_index=True) when pages continue one logical table.

For a no-code path to Excel, the free PDF to Excel converter handles many of the same documents through a browser — useful when you are validating whether automation is worth building.

Method 2: Extract tables with Camelot

Camelot targets analysts who need accuracy metrics and visual QA. It reports how well a table was parsed and can export plots showing detected cell boundaries — invaluable when a finance partner sends slightly different PDF templates each quarter.

Two flavors matter: lattice for tables with ruling lines, stream for whitespace-aligned columns. European invoices often need experimentation between them when VAT rows use merged cells.

import camelot
tables = camelot.read_pdf("report.pdf", pages="1-3", flavor="lattice")
print(f"Total tables detected: {tables.n}")
tables[0].to_excel("extracted_table.xlsx")
tables[0].df.head()

Inspect tables[0].parsing_report for accuracy scores. Low whitespace or edge scores signal you should try flavor="stream" or adjust table_areas. Camelot’s TableList behaves like a sequence: iterate all tables, export each to its own sheet, or merge in pandas after normalising column names.

Performance: Camelot is slower than Tabula on large PDFs because it analyses page geometry deeply. Run page ranges during development, then batch full documents overnight.

Production note: treat Camelot output as draft data. Validate totals (sum of line amounts vs invoice total) before posting to ERP systems — the same reconciliation you would do after manual entry.

Method 3: Extract tables with PDFPlumber

PDFPlumber builds on pdfminer.six and exposes words, chars, rectangles, and tables per page. Developers who need both paragraph text and adjacent tables on mixed layouts often standardise on PDFPlumber because one library handles both without switching contexts.

extract_tables() returns nested lists of strings (rows of cells). Convert to pandas when you need analytics:

import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
    for page_num, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

For Excel export, wrap each table:

import pandas as pd
# inside the page loop, after extract_tables():
for idx, table in enumerate(tables):
    if not table:
        continue
    df = pd.DataFrame(table[1:], columns=table[0])
    df.to_excel(f"page{page_num + 1}_table{idx + 1}.xlsx", index=False)

Tuning: PDFPlumber accepts table_settings — vertical strategy, explicit column boundaries, snap tolerance. Borderless financial tables often need "vertical_strategy": "text" or manual explicit_vertical_lines derived from word x-coordinates.

Strength: access to bounding boxes for each cell’s text, enabling custom rules (“amount must be right-aligned in the last column”). Weakness: no built-in OCR; scanned pages return empty or garbage rows until you preprocess images.

Handling scanned PDFs (OCR + Python)

Tabula, Camelot, and PDFPlumber assume a text layer — vector instructions that place characters at coordinates. Scanned PDFs are images embedded per page; copy-paste yields nothing useful, and table libraries return empty results or misaligned junk.

The standard Python pipeline is:

Convert each PDF page to a high-resolution image (often 300 DPI).
Run OCR to recover text and optionally bounding boxes.
Reconstruct tables with heuristics, specialised models, or AI extraction.

What is OCR? explains recognition engines in plain language. For table-specific work, Tesseract via pytesseract is a common open-source starting point — not perfect on dense grids, but free and scriptable.

Example: Tesseract with pdf2image

Install Tesseract on the OS, then pip install pytesseract pdf2image pillow pandas. Poppler is required for pdf2image on Linux (poppler-utils package).

from pdf2image import convert_from_path
import pytesseract
import pandas as pd

pdf_path = "scanned_statement.pdf"
pages = convert_from_path(pdf_path, dpi=300)

all_text = []
for i, page_image in enumerate(pages):
    text = pytesseract.image_to_string(page_image, lang="eng")
    all_text.append({"page": i + 1, "text": text})
    print(f"--- Page {i + 1} ---")
    print(text[:500])

# Optional: layout-aware OCR returns word boxes for custom table logic
data = pytesseract.image_to_data(pages[0], output_type=pytesseract.Output.DICT)
df_boxes = pd.DataFrame(data)
df_boxes = df_boxes[df_boxes["text"].str.strip() != ""]

Raw OCR text rarely arrives as clean CSV. Teams cluster words by y-coordinate to infer rows, split on column gaps by x-coordinate, or send the page image to document-AI APIs. Expect iteration: scanned tables are why many projects abandon pure open-source stacks after the first production batch.

Alternative: Inputo (OCR + extraction without code)

If maintaining Tesseract languages, deskewing scans, and hand-tuning column detectors is not your core product, use a platform that bundles OCR, layout analysis, and export. Inputo processes invoices, bank statements, payslips, and mixed PDFs in seven European languages, then delivers Excel or CSV — the same outcome as a Python pipeline without JVM dependencies or per-template scripts.

Upload a scanned invoice at inputo.app/app or try the public PDF to Excel converter for quick table exports. For invoice-specific field mapping (vendor, tax ID, totals), see AI invoice processing and compare with full-document AI PDF extraction.

Have a scanned PDF right now? Upload it — Inputo runs OCR and table detection automatically.

Try PDF to Excel →

From Python to production

A Jupyter notebook proves feasibility; production asks for reliability, security, and ownership. Three deployment patterns dominate.

1. Local scripts and scheduled jobs

Finance analysts run Tabula on a shared drive nightly. Cron or Task Scheduler invokes Python; output lands in a network folder. Pros: low cost, full control. Cons: fragile when PDF formats change, Java/Ghostscript drift on servers, no central audit trail.

2. Custom APIs and microservices

Engineering wraps Camelot or PDFPlumber in FastAPI, stores files in object storage, returns JSON or Excel links. Pros: integrates with your stack. Cons: you own OCR for scans, monitoring, scaling, and PII compliance (encryption, retention, access logs).

3. Document AI platforms

Vendors — including Inputo — host ingestion, OCR, table detection, optional LLM field mapping, and export connectors. Pros: faster time-to-value, handles scans and multilingual layouts, built-in review UI. Cons: per-document pricing and vendor dependency — acceptable when manual entry cost dominates.

When Python alone is enough: high volume of homogeneous, digital PDFs; in-house Python skills; tolerance for occasional manual fixes; no strict regulatory requirement for vendor SOC reports.

When to add Inputo or similar: mixed scans and digital files; European payroll or invoice schemas; need for human review before ERP import; small team without time to maintain Java, Ghostscript, and Tesseract across environments.

Hybrid architectures are common: Python orchestrates file pickup, calls Inputo’s API or app for hard documents, and runs pandas validation on everything before loading the warehouse. That preserves code you already trust while outsourcing the long tail of ugly PDFs.

Security checklist for any path: encrypt files in transit, delete after processing, restrict who can download exports, and document retention — especially for payroll and banking PDFs subject to GDPR.

Choosing the right approach (quick reference)

Digital PDF, simple grid → start with tabula-py.
Ruled tables, need QA plots → Camelot lattice.
Borderless or text-heavy pages → PDFPlumber with tuned settings.
Need only raw text fast → PyMuPDF, then optional table pass.
Scanned or photo PDF → OCR first, or Inputo / AI extraction.
Invoice line items + header fields → table extraction plus invoice parser logic.

Frequently asked questions

Can Python extract tables from PDF?

Yes. Libraries like tabula-py, Camelot, and PDFPlumber are built for born-digital PDFs with real text layers. They return structures you can load into pandas and export to Excel. Scanned documents need an OCR step first; Python can orchestrate that with pytesseract or delegate to a cloud document service.

Which Python library is best for PDF table extraction?

It depends on layout. tabula-py is the fastest starting point for clear digital tables. Camelot offers higher control and accuracy reporting on ruled grids. PDFPlumber suits mixed pages where you need words and tables together. Benchmark all three on a sample of your actual files before standardising.

How do I extract tables from scanned PDFs in Python?

Convert pages to images, run Tesseract or another OCR engine, then build rows from word positions or use AI layout models. None of the three main table libraries read scans reliably out of the box. For production scan volume, consider Inputo or an API that combines OCR and table detection.

Can I export extracted tables to Excel with Python?

Yes. pandas DataFrame.to_excel() works with tabula and Camelot output. PDFPlumber lists convert with pd.DataFrame(table[1:], columns=table[0]) when the first row is headers. Install openpyxl or xlsxwriter as the Excel engine.

Do I need OCR for PDF table extraction?

Only if the PDF is image-based. Test by selecting text in a viewer: if selection works, skip OCR. If the document is a scan or photo export, OCR is mandatory before any table algorithm will succeed.

What if my PDF has complex merged cells?

Try Camelot lattice, PDFPlumber custom vertical lines, or manual cleanup after export. Merged headers and rowspan cells are the hardest case for rule-based tools. AI extractors infer structure from visual context and often outperform regex on irregular government and payroll forms.

Conclusion

Python gives you a complete toolkit to extract table data from PDF files: Tabula for quick DataFrames, Camelot for accuracy-sensitive grids, PDFPlumber for fine control, and OCR when scans enter the mix. The code in this guide is enough to automate a folder of digital invoices or statements today.

When documents diversify — languages, scans, merged cells, payroll schemas — maintenance cost rises. That is when teams pair scripts with the PDF to Excel converter for ad-hoc files or the full Inputo app for OCR, AI extraction, and payroll-ready exports without maintaining JVM and Tesseract stacks.

Start with the library that matches your best-looking PDF. Measure accuracy on real volume. Promote only what survives contact with production data.

Prefer zero code? Upload your PDF — Inputo extracts tables and exports Excel in minutes.

Launch Inputo app →

Extract Tables from PDF with Python — Complete Guide with Code Examples