This repo contains the code for extracting structured data from CONSORT flow diagrams in PDF files reporting randomized trials.
conda create -n flow python==3.11.11 -y
conda activate flowgit clone https://github.com/EPPI-Centre/flowchart-data-extraction.git
cd flowchart-data-extraction
pip install -e .In the root of this repo, you must create a file called .env. In this file you
will register your OpenAI API key as so:
OPENAI_API_KEY=COPY_AND_PASTE_YOUR_API_KEY_HEREWindows (Powershell):
$Env:OUTPUT_IMAGE_FORMAT = "PNG"
marker --output_dir OUTPUT_DIR INPUT_DIRMac/Linux:
export OUTPUT_IMAGE_FORMAT="PNG"
marker --output_dir OUTPUT_DIR INPUT_DIRpython classify_images_as_flowchart.pypython parse_flowchart_images.py- Pdffigures2 https://github.com/allenai/pdffigures2 (try last as recall below 99%)
- PDF-Extract-Kit https://github.com/opendatalab/PDF-Extract-Kit
- MinerU https://github.com/opendatalab/MinerU (preliminary results look very good)
- Figure out Adobe problem.
- DeepFigures https://github.com/allenai/deepfigures-open
- Marker https://github.com/datalab-to/marker
- OLMOCR https://github.com/allenai/olmocr
- Surya https://github.com/datalab-to/surya
- Docling