On the Responsible use of Pseudo-Random Number Generators in Scientific Research

Introduction

Are you conducting academic research with a computational tool that produces a different outcome each time? Great; you are probably familiar with the idea of setting a seed to instantiate the pseudo-random number generator causing this variance. Conventional wisdom in the academic literature and community forums generally advocates setting a single seed to consistently generate the same result regardless of who runs a script or where the code is run (assuming identical system dependencies).

In this work we argue the opposite. Computational reproducibility is vital, but it should not come at the expense of assuming that a single scalar estimand is invariant to the seed. Through a large number of simulations, teaching examples, and high-profile replications, we describe and showcase how to analyze and visualize seed variability for the scientific record.

Quick start

Python 3.11; install requirements with pip install -r requirements.txt.
Most scripts run from ./src. Example: python src/build_llms.py (requires an OpenAI API key at ./keys/openai_api_key.txt).
Seeds live in ./assets/seed_list.txt and are generated via int.from_bytes(secrets.token_bytes(4), 'big') capped at 2147483647.
Figures and tables are written to ./figures/ and ./tables/; many scripts emit CSVs alongside.

Repository map (selected)

src/: main analyses (e.g., build_llms.py, build_compound_grf.py, build_rw_seeds.py, build_schelling.py, build_mnist_seeds.py, build_needle_seeds.py).
assets/: seed lists and supporting art (seed_list.txt, ffc_seeds.png).
figures/, tables/: outputs generated by scripts.
data/: small bundled data plus placeholders for external downloads (see below).
paper_offline/: manuscript materials.

Code

./src/ contains the computational (re-)analyses: Buffon's Needle, Bitcoin random walks, cross-validation sensitivity, Fragile Families Challenge, mvprobit in Stata, and more. A summarizing notebook for visualizations lives at src/visualization_notebook.ipynb. Requirements are listed in requirements.txt.

Data

Most of the code comes self-contained; it generates simulated data, or pulls data down from internet archives. In two specific examples, it is necessary to download the data:

Data to replicate the results in the Fragile Families Challenge comes from the "Replication materials for Measuring the predictability of life outcomes using a scientific mass collaboration" site on the Harvard Dataverse, available here.
Data from the Millenium Cohort Study which is necessary to replicate the paper comes from here but requires a little pre-processing as per Dr. Orben's GitHub repo related to that work.

License

See LICENSE for terms. The datasets above have their own licensing conditions; respect those accordingly.

Acknowledgements

We are grateful to the extensive comments made by various people in our course of thinking about this work, not least the members of the Leverhulme Centre for Demographic Science.

If you're having issues with anything failing to run, or have comments about where seeds are applicable in your own workstreams, please don't hesitate to raise an issue or otherwise get in contact!

Other areas where we seeds to be important but not included for analysis here:

Exponential Random Graph Models
Bayesian Factor Analysis (and other Bayesian frameworks more generally)
Some simple KNN type of thing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

On the Responsible use of Pseudo-Random Number Generators in Scientific Research

Introduction

Quick start

Repository map (selected)

Code

Data

License

Acknowledgements

Other areas where we seeds to be important but not included for analysis here:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
assets		assets
src		src
tables		tables
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

crahal/seeds

Folders and files

Latest commit

History

Repository files navigation

On the Responsible use of Pseudo-Random Number Generators in Scientific Research

Introduction

Quick start

Repository map (selected)

Code

Data

License

Acknowledgements

Other areas where we seeds to be important but not included for analysis here:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages