Skip to content

Conversation

@zoidy
Copy link
Collaborator

@zoidy zoidy commented Sep 5, 2025

Description

Use playwright to scrape each dataset's page and extract the necessary folder structure. Then recreate that folder structure when downloading files via the API.

See #125

Documentation Update

  • I have updated README.md and other relevant documentation
  • No documentation update is needed

Implementation Notes

This method uses playwright.

  • create a new folder called browsers in the mamba environment
  • add a new environment variable to the environment to install browsers into that directory conda env config vars set PLAYWRIGHT_BROWSERS_PATH=<path to envs folder>/envs/rebach/browsers
  • Install playwright into the environment following the docs mamba install pytest-playwright
  • Install Chromium and dependencies following the docs. playwright install --with-deps --no-shell chromium

The data structure is contained in a <script> tag. This tag is stripped of any Javascript to leave a JSON string that is converted to a Python dict via json.loads(). The actual directory information is in a Python dict the form

{'<fileid1>` : `path/to/file1`, '<fileid2>` : `path/to/file2`}

where <fileid> is the Figshare file id. If the file is in the root, the value corresponding to the given file id will be an empty string.

@zoidy zoidy linked an issue Sep 5, 2025 that may be closed by this pull request
1 task
@zoidy zoidy changed the title Feat: describe enhancement or feature (Issue #125) Feat: extract folder structure for datasets (Issue #125) Sep 8, 2025
Folder structure is retrieved during initial fetch for public, non-embargoed items. Embargoed files will download with no folder structure
@zoidy zoidy marked this pull request as draft October 30, 2025 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ability to capture folder structure of deposits

2 participants