Synthetic Dataset Generator

This repository contains code to generate synthetic datasets at a reasonable volume and with sufficient variable parameters for useful ML test set creation.

Repository Structure

       configs/: Sample dataset config files
       trends and distributions/: submodules for math functions
       sdg.py: main Python runner and CLI
       sdg_multi.py: main Python runner and CLI for multivariate generation

Installation

Clone the repository and make it your working directory.
Create a virtual environment for this project to manage dependencies:
- python3 -m venv .venv
  - .venv is a common choice for directory name according to docs.python.org/3/tutorial/venv.html.
  - Removing the '.' will make the directory visible, and the name can be changed if the user would prefer a more descriptive one (if, for example, their shell displays the venv name or they use multiple projects with venv).
Activate the virtual environment:
- source .venv/bin/activate
Finally, install the dependencies:
- (Optional) Verify the python version: python --version
- (Optional) Verify the pip version: pip --version
  - Both are listed in the requirements file
- pip install -r requirements.txt

Configuration Examples

See quant_with_sex_and_race.yaml for a dataset with one quantitative variable distributed by ADRD, sex, and race.
See quants_with_sex_and_race.yaml for a dataset with multiple variables distributed by ADRD, sex, and race.

Configuration Walkthrough

The generator depends on a YAML file which describes the subsamples and variables to include in the synthetic dataset.

1. Define the dataset-level parameters, including the list of variable names and, optionally, any demographics.

Dataset Parameters

key	type	notes	Optional
results_file	String	Name to give saved dataset file
n_adrd	int	ADRD subsample size
n_not_adrd	int	Not-ADRD subsample size
quarter_count	int	Number of quarters to generate data for
variables	list of strings	Names of variables to add to dataset
demographics	list of strings	Demographics to distribute across	X
insert_missing_values_pct	float	percent of data to set missing at random

Adding a variable for the first time:

Update 'VAR_TO_NAME_MAP' in 'maps.py' with a key for the variable name used in the config and as a value the string to be printed for that variable

2. For EACH variable:

Add a key using its name and a dictionary with the keys 'adrd_sample' and 'not_adrd_sample'
For EACH sample, define the following:

key	values	notes
distribution_function	['gamma', 'weibull']	function describing variable's initial distribution
distribution_parameters	{"a" : float, "scale" : float}
trend_function	['linear', 'poly', 'exp']	function describing variable's trend in the sample
trend_parameters	varies by trend

3. For EACH demographic:

Add a key using its name and a dictionary
Add the key 'subcategories' to a list of strings with the subcategories of the demographic
For EACH subgroup, define the sample's characteristics according to the above table.

Parameter Validation

The key/value pairs "variables" and "demographics" are used to automatically extend the above schemas (To-Do) according to Cartesian Product of the demographic variables.

See validation_utils.py for the template schema validated against.
See processing_utils.py for the parsing/ingestion code.
See ../utilities/validation_utils.py for the most up to date version of the TEMPLATE_SCHEMA

Usage

Copy the config file and update it to desired generation settings.
Run python3 generator.py -c CONFIG
- e.g. python3 generator.py -c configs/quant_variables.yaml
- Add debug flag for helpful intermediate prints: python3 generator.py -c config.yaml --debug
- (Optional, temporary for Lisa) Add --demo_model to run basic Random Forest Classifier on combined generated dataset
  - python3 generator.py -c generation_configs/sample_generation.yaml --demo_model

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
trends_and_distributions		trends_and_distributions
README.md		README.md
__init__.py		__init__.py
sdg.py		sdg.py
sdg_multi.py		sdg_multi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Dataset Generator

Repository Structure

Installation

Configuration Examples

Configuration Walkthrough

1. Define the dataset-level parameters, including the list of variable names and, optionally, any demographics.

Dataset Parameters

Adding a variable for the first time:

2. For EACH variable:

3. For EACH demographic:

Parameter Validation

Usage

About

Uh oh!

Releases

Packages

Languages

GU-DataLab/healthinf_synthetic

Folders and files

Latest commit

History

Repository files navigation

Synthetic Dataset Generator

Repository Structure

Installation

Configuration Examples

Configuration Walkthrough

1. Define the dataset-level parameters, including the list of variable names and, optionally, any demographics.

Dataset Parameters

Adding a variable for the first time:

2. For EACH variable:

3. For EACH demographic:

Parameter Validation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages