This repository contains code to generate synthetic datasets at a reasonable volume and with sufficient variable parameters for useful ML test set creation.
configs/: Sample dataset config files
trends and distributions/: submodules for math functions
sdg.py: main Python runner and CLI
sdg_multi.py: main Python runner and CLI for multivariate generation
-
Clone the repository and make it your working directory.
-
Create a virtual environment for this project to manage dependencies:
python3 -m venv .venv-
.venv is a common choice for directory name according to docs.python.org/3/tutorial/venv.html.
-
Removing the '.' will make the directory visible, and the name can be changed if the user would prefer a more descriptive one (if, for example, their shell displays the venv name or they use multiple projects with venv).
-
-
Activate the virtual environment:
source .venv/bin/activate
-
Finally, install the dependencies:
- (Optional) Verify the python version:
python --version - (Optional) Verify the pip version:
pip --version- Both are listed in the requirements file
pip install -r requirements.txt
- (Optional) Verify the python version:
- See quant_with_sex_and_race.yaml for a dataset with one quantitative variable distributed by ADRD, sex, and race.
- See quants_with_sex_and_race.yaml for a dataset with multiple variables distributed by ADRD, sex, and race.
The generator depends on a YAML file which describes the subsamples and variables to include in the synthetic dataset.
1. Define the dataset-level parameters, including the list of variable names and, optionally, any demographics.
| key | type | notes | Optional |
|---|---|---|---|
| results_file | String | Name to give saved dataset file | |
| n_adrd | int | ADRD subsample size | |
| n_not_adrd | int | Not-ADRD subsample size | |
| quarter_count | int | Number of quarters to generate data for | |
| variables | list of strings | Names of variables to add to dataset | |
| demographics | list of strings | Demographics to distribute across | X |
| insert_missing_values_pct | float | percent of data to set missing at random |
- Update 'VAR_TO_NAME_MAP' in 'maps.py' with a key for the variable name used in the config and as a value the string to be printed for that variable
- Add a key using its name and a dictionary with the keys 'adrd_sample' and 'not_adrd_sample'
- For EACH sample, define the following:
| key | values | notes |
|---|---|---|
| distribution_function | ['gamma', 'weibull'] | function describing variable's initial distribution |
| distribution_parameters | {"a" : float, "scale" : float} | |
| trend_function | ['linear', 'poly', 'exp'] | function describing variable's trend in the sample |
| trend_parameters | varies by trend |
- Add a key using its name and a dictionary
- Add the key 'subcategories' to a list of strings with the subcategories of the demographic
- For EACH subgroup, define the sample's characteristics according to the above table.
The key/value pairs "variables" and "demographics" are used to automatically extend the above schemas (To-Do) according to Cartesian Product of the demographic variables.
-
See validation_utils.py for the template schema validated against.
-
See processing_utils.py for the parsing/ingestion code.
-
See ../utilities/validation_utils.py for the most up to date version of the TEMPLATE_SCHEMA
-
Copy the config file and update it to desired generation settings.
-
Run
python3 generator.py -c CONFIG-
e.g.
python3 generator.py -c configs/quant_variables.yaml -
Add debug flag for helpful intermediate prints:
python3 generator.py -c config.yaml --debug -
(Optional, temporary for Lisa) Add --demo_model to run basic Random Forest Classifier on combined generated dataset
python3 generator.py -c generation_configs/sample_generation.yaml --demo_model
-