This repo extracts and transforms data used for the NYC Climate Dashboard.
Data pipelines are run automatically on a schedule to get new data and save the most up-to-data summary data to the Summary Data folder.
Summary data are available in the Summary Data folder.
All data included here are from public, open sources.
[table tk]
run_extractors.py is the main entrypoint. Calling this script or running its run_all() function will run all extract-transform pipelines, save summary data, and return all the summary data.
The pipelines directory contains individual processes, each of which extracts data from a single source. In general, each pipeline runs a SQL query against an OpenData, normalizes or reshapes the result as needed, saves the summary to a file and also returns the summary data as a DataFrame. For data sources from which multiple summaries are needed (such as greenhouse gas inventory), the pipeline will run multiple queries and produce multiple outputs.
The climate_dash_tools module includes functions used for set up and querying.
The data extract pipelines are run automatically on GitHub Actions once a month. The .github/workflows/action.yaml sets up automation.
NYC OpenData requires an app token to process large queries. For GitHub Actions automation, an this repo has an app token is stored in its repo secrets. To create a copy, add a token following step 3 below.
The (few) dependencies are specified in the pyproject.toml. This project uses uv to manage dependencies. To use uv:
- install
uv - use
uv runto run modules and scripts, e.g.uv run python -m run_extractors.uvwill automatically build a virtual environment with the necessary dependencies and run the program in this environment.
- Use
uvto automatically install dependencies.- install
uv - when you use
uv runto run modules (see step 4),uvwill automatically build a virtual environment with the necessary dependencies and run the program in this environment.
- install
- (Alternatively, use another package manager to install the requirements in
pyproject.toml)
- Create an account on NYC OpenData
- Create an app token
- navigate your user name ➝ Developer Settings ➝ Create New App Token
- Copy
.env.templateas.env - Paste your app token to
.env
- Run all extractors with
uv run python -m run_extractors - Run a single pipeline with
uv run python -m pipelines.ghg_emissions(or swap any other pipeline name frompipelinesforghg emissionshere.)
These extract-transforms steps have been made public so that anyone can help maintain them or add to them as source data tables, APIs, etc. may change. If data summaries are not running successfully (or if you want to extend or add to the summaries here), please contribute with a pull request! (Create a local clone, identify and fix the problem, or add the new feature, then open a pull request with the fix.)
Use this repo as a model or template for other automated data summarization tasks!
A private copy of this repo includes additional steps to load the data so they can be displayed on the NYC Climate Dashboard