Skip to content

GU-DataLab/healthinf_synthetic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Dataset Generator

This repository contains code to generate synthetic datasets at a reasonable volume and with sufficient variable parameters for useful ML test set creation.

Repository Structure

       configs/: Sample dataset config files
       trends and distributions/: submodules for math functions
       sdg.py: main Python runner and CLI
       sdg_multi.py: main Python runner and CLI for multivariate generation

Installation

  1. Clone the repository and make it your working directory.

  2. Create a virtual environment for this project to manage dependencies:

    • python3 -m venv .venv
      • .venv is a common choice for directory name according to docs.python.org/3/tutorial/venv.html.

      • Removing the '.' will make the directory visible, and the name can be changed if the user would prefer a more descriptive one (if, for example, their shell displays the venv name or they use multiple projects with venv).

  3. Activate the virtual environment:

    • source .venv/bin/activate
  4. Finally, install the dependencies:

    • (Optional) Verify the python version: python --version
    • (Optional) Verify the pip version: pip --version
      • Both are listed in the requirements file
    • pip install -r requirements.txt

Configuration Examples


Configuration Walkthrough

The generator depends on a YAML file which describes the subsamples and variables to include in the synthetic dataset.

1. Define the dataset-level parameters, including the list of variable names and, optionally, any demographics.

Dataset Parameters

key type notes Optional
results_file String Name to give saved dataset file
n_adrd int ADRD subsample size
n_not_adrd int Not-ADRD subsample size
quarter_count int Number of quarters to generate data for
variables list of strings Names of variables to add to dataset
demographics list of strings Demographics to distribute across   X
insert_missing_values_pct float percent of data to set missing at random
Adding a variable for the first time:
  • Update 'VAR_TO_NAME_MAP' in 'maps.py' with a key for the variable name used in the config and as a value the string to be printed for that variable

2. For EACH variable:

  • Add a key using its name and a dictionary with the keys 'adrd_sample' and 'not_adrd_sample'
  • For EACH sample, define the following:
key values notes
distribution_function ['gamma', 'weibull'] function describing variable's initial distribution
distribution_parameters {"a" : float, "scale" : float}
trend_function ['linear', 'poly', 'exp'] function describing variable's trend in the sample
trend_parameters varies by trend

3. For EACH demographic:

  • Add a key using its name and a dictionary
  • Add the key 'subcategories' to a list of strings with the subcategories of the demographic
  • For EACH subgroup, define the sample's characteristics according to the above table.

Parameter Validation

The key/value pairs "variables" and "demographics" are used to automatically extend the above schemas (To-Do) according to Cartesian Product of the demographic variables.


Usage

  1. Copy the config file and update it to desired generation settings.

  2. Run python3 generator.py -c CONFIG

    • e.g. python3 generator.py -c configs/quant_variables.yaml

    • Add debug flag for helpful intermediate prints: python3 generator.py -c config.yaml --debug

    • (Optional, temporary for Lisa) Add --demo_model to run basic Random Forest Classifier on combined generated dataset

      • python3 generator.py -c generation_configs/sample_generation.yaml --demo_model

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages