flowchart-data-extraction

This repo contains the code for extracting structured data from CONSORT flow diagrams in PDF files reporting randomized trials.

Setup

Create environment

conda create -n flow python==3.11.11 -y
conda activate flow

Install

git clone https://github.com/EPPI-Centre/flowchart-data-extraction.git
cd flowchart-data-extraction
pip install -e .

Assign OpenAI API Key

In the root of this repo, you must create a file called .env. In this file you will register your OpenAI API key as so:

OPENAI_API_KEY=COPY_AND_PASTE_YOUR_API_KEY_HERE

Quickstart

Extract Figures From PDF

Windows (Powershell):

$Env:OUTPUT_IMAGE_FORMAT = "PNG"
marker --output_dir OUTPUT_DIR INPUT_DIR

Mac/Linux:

export OUTPUT_IMAGE_FORMAT="PNG"
marker --output_dir OUTPUT_DIR INPUT_DIR

Extract CONSORT From Images Dir

python classify_images_as_flowchart.py

Parse CONSORT From Images Dir

python parse_flowchart_images.py

Figure extraction tools to test

Pdffigures2 https://github.com/allenai/pdffigures2 (try last as recall below 99%)
PDF-Extract-Kit https://github.com/opendatalab/PDF-Extract-Kit
MinerU https://github.com/opendatalab/MinerU (preliminary results look very good)
Figure out Adobe problem.
DeepFigures https://github.com/allenai/deepfigures-open
Marker https://github.com/datalab-to/marker
OLMOCR https://github.com/allenai/olmocr
Surya https://github.com/datalab-to/surya
Docling

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
mini_data		mini_data
original-method		original-method
src/flowde		src/flowde
.gitignore		.gitignore
README.md		README.md
classify_images_as_flowchart.py		classify_images_as_flowchart.py
parse_flowchart_images.py		parse_flowchart_images.py
pyproject.toml		pyproject.toml
test_image_extraction_and_classification.py		test_image_extraction_and_classification.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

flowchart-data-extraction

Contents

Setup

Create environment

Install

Assign OpenAI API Key

Quickstart

Extract Figures From PDF

Extract CONSORT From Images Dir

Parse CONSORT From Images Dir

Figure extraction tools to test

About

Uh oh!

Releases

Packages

Languages

EPPI-Centre/flowchart-data-extraction

Folders and files

Latest commit

History

Repository files navigation

flowchart-data-extraction

Contents

Setup

Create environment

Install

Assign OpenAI API Key

Quickstart

Extract Figures From PDF

Extract CONSORT From Images Dir

Parse CONSORT From Images Dir

Figure extraction tools to test

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages