reports

PDF Table Extraction

This project provides a flexible tool to extract tables from PDF files using Camelot.

Quick Start with Dev Container

Open this project in VS Code
When prompted, click "Reopen in Container" (or use Command Palette: "Dev Containers: Reopen in Container")
Wait for the container to build (dependencies install automatically via onCreateCommand)
If you rebuilt the container, install the package:
```
make install
```
Run the CLI tool:
```
extract-tables data/ csv/ --recursive
```

Manual Setup

If you prefer not to use the dev container:

# Clone with submodules
git clone --recurse-submodules <repository-url>

# Or if already cloned, initialize submodules
git submodule update --init --recursive

# Install system dependencies (Ubuntu/Debian)
bash scripts/install-system-deps.sh

# Install Python dependencies and package
bash scripts/install.sh

Usage

The package installs a CLI command extract-tables:

# Basic usage: extract to CSV
extract-tables data/ csv/

# Extract to JSON format
extract-tables data/ output/ --format json

# Use lattice flavor (for PDFs with clear table borders)
extract-tables data/ csv/ --flavor lattice

# Process subdirectories recursively
extract-tables data/ csv/ --recursive

# Validate all PDFs have been processed (for CI)
extract-tables -f csv -r --validate data csv

# See all options
extract-tables --help

Or use the Python module directly:

python -m pdf_table_extractor.extract_tables data/ csv/ --recursive

Supported Output Formats

csv: CSV files (one per table)
json: JSON format
excel: Excel spreadsheet (.xlsx)
html: HTML table
markdown: Markdown table
sqlite: SQLite database

Camelot Flavors

stream (default): Best for PDFs without clear table borders
lattice: Best for PDFs with visible table lines

CI/CD Validation

To validate that all PDFs have been processed (without actually processing them), use the --validate flag:

extract-tables -f csv -r --validate data csv

This is useful in CI to ensure the extraction has been run before committing. It:

Checks metadata to verify all PDFs are processed
Exits with code 1 if any PDFs are unprocessed
Runs in <1 second (doesn't process PDFs)
Uses the same validation logic as the main tool

Example Makefile target:

validate:
	extract-tables -f csv -r --validate data csv

See .github/workflows/ci.yml for a GitHub Actions example.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
csv		csv
data @ faabd21		data @ faabd21
scripts		scripts
src/pdf_table_extractor		src/pdf_table_extractor
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

reports

PDF Table Extraction

Quick Start with Dev Container

Manual Setup

Usage

Supported Output Formats

Camelot Flavors

CI/CD Validation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Madgrades/reports

Folders and files

Latest commit

History

Repository files navigation

reports

PDF Table Extraction

Quick Start with Dev Container

Manual Setup

Usage

Supported Output Formats

Camelot Flavors

CI/CD Validation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages