BIDSIFY is a package designed to convert various epilepsy data sources into BIDS-compliant datasets without internet connection or third party data hosting. As the push for standardized datasets grows, harmonizing how we collect and store data has become increasingly important.
- Requirements
- Installation
- Supported Data Types
- Usage
- Converting Multiple files
- Data Pipelines
- Sample Commands
- Upcoming Features
BIDSIFY requires Python 3.0+. It was developed for Mac and Linux systems. If using on a windows machine, pathing may become an issue. This is a known bug and is meant to be fixed in a future release.
BIDSIFY is a purely pythonic means of creating BIDS datasets. It endeavors to do this without the need for an internet connection or data sharing. This allows for more secure BIDS generation that can take place behind clinical firewalls.
Below are two separate methods for installing BIDSIFY to your local workstation. If you are working behind a clinical firewall, and cannot download python packages, we recommend downloading the packages listed within requirements.txt and installing them locally.
- Install Python
- Clone the repo
git clone https://github.com/penn-cnt/BIDSIFY - Create the conda environment
By default, the provided conda yaml will set the environment name to bidify. You can change this in the bidsify.yml file, or provide an additional flag to this command along the lines of
conda env create --file BIDSIFY.yml-n <new-environment-name> - Activate conda environment (Optional)
If you set a new environment name, make sure to change bidsify to the correct environment name.
conda activate bidsify - Attach default postprocessors (Optional)
conda develop <path-to-cnt-codehub> - Test installation
python BIDSIFY.py --example_input
-
Install Python
-
Clone the repo
git clone https://github.com/penn-cnt/BIDSIFY -
Create the python environment
pip -m venv <path-to-environment-folder>Your environment will be saved to a folder on the filesystem, and you will need to provide a path to a folder you want to use for storing these python packages.
-
Activate environment
source <path-to-environment-folder>/bin/activateIf you plan to use this package often, you may want to consider setting an alias to quickly enter this environment. For more on setting up aliases, please refer here.
-
Install packages
pip install -r requirements.txt -
Attach default postprocessors (Optional)
export PYTHONPATH="${PYTHONPATH}:<path-to-cnt-codehub>"If you're working on a windows machine, or can't otherwise modify your path, you can also add a .pth file contain the path to the codehub to your environment folder's site packages.
-
Test installation
python BIDSIFY.py --example_input
BIDSIFY supports both time series and imaging data conversion to BIDS format. This is accomplished by creating different back-ends for data ingestion and BIDS path generation that can be called at run-time. We aim to provide the most general coverage with this alpha release, with design choices meant to ensure future data type additions can be done so as seamlessly as possible.
Currently, the package supports the following data sources:
We currently support the following timeseries data sources
- EDF (using the
--eegflag) - iEEG.org (using the
--ieegflag) - Pennsieve (using the
--pennsieveflag)- Note: This option is not yet fully implemented as the Pennsieve team works on a Python API.
We currently support the following imaging data sources
- Nifti data (using the
--imagingflag)
The recommended method for adding a new data source is to add a new handler for the data source in the components/public folder. This public facing handler is meant to manage the general flow of data processing. Code responsible for actually reading in timeseries or imaging data, as well as running any postprocessing, is available within the components/internal folder, and can be called by attaching their associated observer method. For more information, we recommend visiting here.
There are a number of different options inherent to this package and means to streamline the BIDS creation using sidecar files (typically .csv tabular data). We explain a few of these concepts, and present some examples below.
For a comprehensive list of commands, you should run
python BIDSIFY.py --help
for a detailed help document. For printed examples in terminal, you can also run:
python BIDSIFY.py --print_example
You can download/convert multiple files at once using the --input_csv flag. A breakdown of the allowed headers to the input csv file are as follows:
These fields are shared across both timeseries and imaging input_csv files.
orig_filename- Required. The original filepath on your local machine or the dataset id for iEEG.org/Pennsieve.
uid- Optional. A mapping number used when data is generated by the data team. Its a secret map to a PHI id that is persistant across different datasets.
subject_number- Optional. Subject number to assign to the data. Defaults to 1. Can be entered as a string (i.e.
HUP001)
- Optional. Subject number to assign to the data. Defaults to 1. Can be entered as a string (i.e.
session_number- Optional. Session number to assign to the data. Defaults to 1. Can be entered as a string (i.e.
implant01)
- Optional. Session number to assign to the data. Defaults to 1. Can be entered as a string (i.e.
run_number- Optional. Run number to assign to the data. Defaults to 1.
target- Optional. Additional information to keep associated with the dataset in a
*_targets.picklefile. This could be epilepsy diagnosis, sleep stage, etc.
- Optional. Additional information to keep associated with the dataset in a
start- Optional. The start time of the dataset.
- Note Required if downloading from iEEG.org without using the annotation clip times.
duration- Optional. The duration of the clip.
- Note Required if downloading from iEEG.org without using the annotation clip times.
task- Optional. Task to assign to the data. (i.e.
rest)
- Optional. Task to assign to the data. (i.e.
imaging_data_type- Data Type of the image (i.e. anat/ct/etc.)
imaging_scan_type- Scan Type of the image (i.e. MRI/fMRI/etc.)
imaging_modality- Modality of the image (i.e. T1/flair/etc.)
imaging_task- Task of the image (rest/etc.)
imaging_acq- Acquisition Type of the image (i.e. axial/sagittal/etc.)
imaging_ce- Contrast enrichment type of the image (i.e. ce-gad/etc.)
You can find specific examples of various input files here. The provided examples are:
sample_edf_inputs.csv- This sample is for converting individual edf files on your computer into a BIDS compliant format.
sample_edf_inputs_w_targets.csv- This sample is for converting individual edf files on your computer into a BIDS compliant format with target data (i.e. epilepsy diagnosis, demographic info, etc,) associated.
download_by_annotations.csv- This sample is used for downloading all of the data within a iEEG.org file according to the annotation layer times.
download_by_times.csv- This sample is used for downloading specific time segments from iEEG.org.
sample_nifti_inputs.csv- This sample is for converting individual NIFTI files on your computer into a BIDS compliant format.
BIDSIFY can be used to create data pipelines for uniform pre and post processing of your datasets. This is of great importance in providing reproducible datasets, and making sure all data conforms to the relevant standards. It also allows everyone to leverage common tasks performed by other researchers and create the most viable patient cohort.
At present we provide the following preprocessing options
- anonymize: (Timeseries and Imaging). Check for any PHI information in timeseries datasets or imaging headers. If found, skip this file when creating the final dataset.
- deface: (Imaging). Deface imaging datasets. (Not yet implemented. This is included to showcase how preprocessors can get attached.)
At present we provide the following postprocessing options
- sleep staging: (Timeseries). Create a sidecar csv file that contains the predicted sleep stage within each 30 second window of the timeseries. Currently uses YASA, and is limited to scalp timeseries that contain CZ, C03, and C04 channels. (Requires the epipy feature package to be within your python path.)
- tokenization: (Timeseries). Tokenize annotations and metadata for each file to create a lookup table that can be queried to make patient cohorts from all data within a dataset. For example, tokens such as sleep, N2, etc. could be queried to get all files in a dataset that match the criteria requested.
New pre and post processors can be added by attaching new observer objects within a data type handler. The typical name for this method is attach_objects. For more information, please refer to the relevant extension point methods found here.
We provide a few sample commands here. Note, all examples utilize a username and filepaths that you will need to update to reflect your own system and credentials.
python BIDSIFY.py --eeg --bids_root --dataset --subject HUP001 --uid_number 1 --session 1 --run 1 --overwrite --target
python BIDSIFY.py --eeg --bids_root --input_csv samples/inputs/sample_edf_inputs_w_target.csv --anonymize
python BIDSIFY.py --imaging --bids_root --dataset --subject_number HUP001 --uid_number 0 --session 001 --run 01 --imaging_data_type anat --imaging_scan_type MR --imaging_modality flair --imaging_task None --imaging_acq ax --imaging_ce None
python BIDSIFY.py --imaging --bids_root --dataset --subject_number HUP001 --uid_number 0 --session 001 --run 01
python BIDSIFY.py --imaging --datalake --bids_root --input_csv samples/inputs/sample_nifti_inputs.csv
python BIDSIFY.py --ieeg --username BJPrager --bids_root --input_csv samples/inputs/download_by_times.csv
python BIDSIFY.py --ieeg --username BJPrager --bids_root --annotations --input_csv samples/inputs/download_by_annotations.csv
A few important notes:
- Overwriting data
- If data with the same bids path already exists, BIDSIFY will skip creating new data by default. If you wish to overwrite data, you should use the
--overwriteflag.
- If data with the same bids path already exists, BIDSIFY will skip creating new data by default. If you wish to overwrite data, you should use the
- Data Error Handling
- When working with data that has errors, the code fails to save. There should be options added to either save data before and after a bad data segment or mask bad data. This shouldn't be the default, but for projects where large datasets are expected, the ability to excise bad segments should be allowed.
- Imaging BIDS Dataset Information
- At present, the BIDS dataset for imaging data is minimal. The required data is present, but the dataset meta information needs to be expanded.