Feat: extract folder structure for datasets (Issue #125) #126

zoidy · 2025-09-05T19:38:42Z

Description

Use playwright to scrape each dataset's page and extract the necessary folder structure. Then recreate that folder structure when downloading files via the API.

See #125

Documentation Update

I have updated README.md and other relevant documentation
No documentation update is needed

Implementation Notes

This method uses playwright.

create a new folder called browsers in the mamba environment
add a new environment variable to the environment to install browsers into that directory conda env config vars set PLAYWRIGHT_BROWSERS_PATH=<path to envs folder>/envs/rebach/browsers
Install playwright into the environment following the docs mamba install pytest-playwright
Install Chromium and dependencies following the docs. playwright install --with-deps --no-shell chromium

The data structure is contained in a <script> tag. This tag is stripped of any Javascript to leave a JSON string that is converted to a Python dict via json.loads(). The actual directory information is in a Python dict the form

{'<fileid1>` : `path/to/file1`, '<fileid2>` : `path/to/file2`}

where <fileid> is the Figshare file id. If the file is in the root, the value corresponding to the given file id will be an empty string.

Folder structure is retrieved during initial fetch for public, non-embargoed items. Embargoed files will download with no folder structure

Fetch folder structure from page proof of concept

86ec769

zoidy linked an issue Sep 5, 2025 that may be closed by this pull request

Ability to capture folder structure of deposits #125

Open

1 task

zoidy added 2 commits September 7, 2025 14:11

Use os.path functions instead of string concat in process_articles

e78c145

Add playwright to requirements.txt and update docs

f6373e8

zoidy changed the title ~~Feat: describe enhancement or feature (Issue #125)~~ Feat: extract folder structure for datasets (Issue #125) Sep 8, 2025

zoidy added 3 commits September 9, 2025 13:53

Move playwright test to tests folder

aab8460

create folder structure for downloaded files

d8a8c84

Folder structure is retrieved during initial fetch for public, non-embargoed items. Embargoed files will download with no folder structure

lint

6d093fe

zoidy marked this pull request as draft October 30, 2025 14:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: extract folder structure for datasets (Issue #125) #126

Feat: extract folder structure for datasets (Issue #125) #126

Uh oh!

zoidy commented Sep 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feat: extract folder structure for datasets (Issue #125) #126

Are you sure you want to change the base?

Feat: extract folder structure for datasets (Issue #125) #126

Uh oh!

Conversation

zoidy commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zoidy commented Sep 5, 2025 •

edited

Loading