Skip to content

Conversation

@nick-youngblut
Copy link
Contributor

Summary

This PR adds automatic detection and handling of backed AnnData objects, enabling parallel_differential_expression to process datasets that don't fit in memory. When a backed AnnData is detected, the function automatically switches to a memory-efficient chunked processing strategy.

Motivation

Users with large datasets (millions of cells) often load AnnData objects in backed mode (backed="r") to avoid memory issues. Previously, this would fail with an unhelpful AssertionError. This PR:

  1. Provides a clear error message explaining the issue and solutions
  2. Automatically uses a low-memory chunked approach for backed data
  3. Allows users to opt-in to low-memory mode for large in-memory datasets

Changes

New Features

  • Automatic backend selection: Detects backed AnnData and automatically uses chunked processing
  • New parameters:
    • low_memory: bool | None - Force low-memory mode (default: auto-detect)
    • gene_chunk_size: int - Number of genes per chunk (default: 1000)
  • Chunked processing: Processes genes in batches, reducing peak memory from O(n_cells × n_genes) to O(n_cells × chunk_size)

Bug Fixes

  • Fixed is_log1p override bug: Changed if not is_log1p: to if is_log1p is None: to prevent auto-detection from overriding explicit is_log1p=False
  • Fixed typo: "Creating shared memory memory matrix" → "Creating shared memory matrix"

Improvements

  • Added input validation for groupby_key and reference parameters with helpful error messages
  • Improved type hints (added typing.Any, fixed return type annotations)
  • Converted docstrings to Google style for consistency
  • Refactored into separate internal functions for better maintainability:
    • _parallel_differential_expression_standard() - original shared-memory implementation
    • _parallel_differential_expression_chunked() - new low-memory implementation

Usage

Automatic backed mode handling

import scanpy as sc
from pdex import parallel_differential_expression

# This now works automatically!
adata = sc.read_h5ad("large_dataset.h5ad", backed="r")
results = parallel_differential_expression(
    adata,
    reference="control",
    groupby_key="perturbation",
)
# Logs: "Detected backed AnnData, using low-memory chunked processing"

Force low-memory mode

# For large in-memory datasets that cause memory pressure
results = parallel_differential_expression(
    adata,
    reference="control",
    groupby_key="perturbation",
    low_memory=True,
    gene_chunk_size=500,  # Lower = less memory, slower
)

Standard usage (unchanged)

# Works exactly as before for in-memory data
results = parallel_differential_expression(
    adata,
    reference="control",
    groupby_key="perturbation",
    num_workers=4,
)

Memory Comparison

Dataset Standard Mode Low-Memory Mode
1M cells × 30K genes ~240 GB ~4 GB (chunk_size=1000)
100K cells × 30K genes ~24 GB ~400 MB

Breaking Changes

None. The API is fully backward compatible:

  • Existing code continues to work without modification
  • New parameters have sensible defaults
  • Automatic backend selection is transparent to users

…ifferential expression calculations

- Added `_is_backed_array` function to check for backed/HDF5 arrays.
- Introduced `_load_chunk` to handle loading of matrix chunks.
- Updated `parallel_differential_expression` to automatically select processing strategy based on data type.
- Enhanced error handling for backed AnnData in differential expression functions.
- Improved documentation for clarity on parameters and return values.
@gemini-code-assist
Copy link

Summary of Changes

Hello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the parallel_differential_expression function by introducing an automatic low-memory processing mode. This new capability allows the function to handle very large AnnData datasets, especially those loaded in backed mode, by processing genes in memory-efficient chunks. The change ensures that users can analyze large-scale single-cell data without encountering out-of-memory errors, making the tool more robust and scalable.

Highlights

  • Automatic Low-Memory Mode: The parallel_differential_expression function now automatically detects backed AnnData objects and switches to a memory-efficient chunked processing strategy.
  • Large Dataset Support: Enables processing of datasets that previously failed due to memory constraints, particularly for large AnnData objects loaded in backed mode.
  • New Parameters: Introduced low_memory (to force chunked mode) and gene_chunk_size (to control batch size in low-memory mode).
  • Improved Error Handling: Provides clear error messages and solutions when attempting to use backed AnnData with the standard, memory-intensive processing.
  • Code Refactoring: The core logic has been refactored into _parallel_differential_expression_standard (original) and _parallel_differential_expression_chunked (new low-memory implementation), with a new main dispatcher function.
  • Bug Fixes & Improvements: Addressed an is_log1p auto-detection bug, corrected a typo in a logging message, and enhanced input validation and type hinting.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link

Summary of Changes

Hello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the parallel_differential_expression function by introducing an automatic low-memory mode. It allows the function to seamlessly handle large AnnData objects, especially those loaded in backed mode, which previously caused memory issues. By implementing a chunked processing strategy, the function can now process datasets that exceed available RAM, making it more robust and accessible for large-scale single-cell genomics analyses without requiring manual memory management from the user.

Highlights

  • Automatic Low-Memory Mode: The parallel_differential_expression function now automatically detects backed AnnData objects and switches to a memory-efficient chunked processing strategy, enabling analysis of datasets that don't fit in memory.
  • New Parameters for Memory Control: Introduced low_memory (to explicitly force low-memory mode) and gene_chunk_size (to control chunk size in low-memory mode) parameters for fine-grained memory management.
  • Refactored Implementation: The core differential expression logic has been refactored into two distinct internal functions: _parallel_differential_expression_standard (the original shared-memory approach) and _parallel_differential_expression_chunked (the new low-memory approach), with the main function acting as a smart dispatcher.
  • Improved Error Handling and Validation: Enhanced error messages are now provided when attempting to use backed AnnData with the standard shared-memory mode, guiding users to appropriate solutions. Input validation for groupby_key and reference parameters has also been added.
  • Bug Fixes and Code Quality: Corrected a bug where explicit is_log1p=False could be overridden by auto-detection. A typo in a log message was fixed, docstrings were converted to Google style, and type hints were refined for better maintainability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link

Summary of Changes

Hello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the parallel_differential_expression function by introducing an automatic low-memory processing mode. This change addresses a critical limitation for users working with very large single-cell datasets stored in backed AnnData objects, which previously led to memory exhaustion. By intelligently detecting backed data and switching to a chunked processing approach, the function can now efficiently analyze datasets that do not fit entirely into RAM, making the tool more robust and scalable for large-scale genomics research.

Highlights

  • Automatic Low-Memory Mode: The parallel_differential_expression function now automatically detects backed AnnData objects and switches to a memory-efficient chunked processing strategy.
  • Enhanced Memory Handling: This change enables processing of large datasets (millions of cells) that previously failed due to memory constraints, by reducing peak memory usage from O(n_cells × n_genes) to O(n_cells × chunk_size).
  • New Parameters: Introduced low_memory (to force low-memory mode) and gene_chunk_size (to control chunk size) for fine-grained control over memory usage.
  • Improved Error Handling: Provides clear error messages when attempting to use standard mode with backed AnnData, guiding users to appropriate solutions.
  • Bug Fixes: Corrected an issue where is_log1p auto-detection could override explicit False settings and fixed a minor typo in a log message.
  • Code Refactoring & Validation: The core logic has been refactored into _parallel_differential_expression_standard() and _parallel_differential_expression_chunked() for better maintainability, and input validation for groupby_key and reference parameters has been added.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant enhancement by adding an automatic low-memory mode for handling backed AnnData objects. The code is well-structured, with the logic split into a standard shared-memory implementation and a new chunked implementation for low-memory scenarios. A dispatcher function cleanly selects the appropriate backend. The changes also include valuable bug fixes, improved input validation, and more consistent docstrings. My review includes one suggestion to refactor a small piece of duplicated code into a helper function to improve maintainability.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable feature by adding a low-memory, chunked processing mode for backed AnnData objects. The refactoring of the existing implementation into standard and chunked functions improves maintainability. The automatic detection of backed objects is a great usability improvement.

My review focuses on ensuring consistency between the new chunked implementation and the existing standard one, particularly regarding calculation logic and parameter handling. I've identified a few areas for improvement:

  • Correcting an inconsistency in fold-change calculation when clipping is disabled.
  • Fixing a potential TypeError in the mannwhitneyu call within the chunked path.
  • Enhancing the API by exposing the show_progress parameter to the user.

These changes will improve the robustness and consistency of the new low-memory mode. Overall, this is a well-executed and important addition to the library.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable feature: automatic low-memory processing for backed AnnData objects. The refactoring into standard and chunked implementations is well-structured and improves maintainability. The addition of the low_memory and gene_chunk_size parameters provides useful flexibility for users. The bug fixes for is_log1p detection and the typo are also appreciated. My review focuses on improving the robustness of the backed-data detection and ensuring data type consistency across different processing modes. Specifically, I've suggested using the official anndata API for checking backed status, improving the fallback detection logic, and preserving data precision during chunk loading. I've also pointed out a minor inconsistency with a hardcoded constant.

raise ValueError(f"Unknown metric: {metric} :: Expecting: {KNOWN_METRICS}")

# Determine whether to use low-memory mode
is_backed = _is_backed_array(adata.X)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Instead of using the custom _is_backed_array function, it would be more robust and maintainable to use the public anndata API for this check: is_backed = adata.isbacked. The isbacked property is the standard way to check if an AnnData object is backed by an on-disk file and it correctly handles different backing stores like HDF5 and Zarr. The current implementation of _is_backed_array is fragile as it relies on internal type and module names and does not cover all cases (e.g., Zarr).

Suggested change
is_backed = _is_backed_array(adata.X)
is_backed = adata.isbacked

Comment on lines +44 to +52
# Common backed array types from h5py and anndata
backed_indicators = [
"h5py" in module_name,
"Dataset" in type_name,
"AnnDataFileManager" in type_name,
"SparseDataset" in type_name,
# h5py Dataset has 'id' and 'file' attributes
hasattr(data, "id") and hasattr(data, "file"),
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function for detecting backed arrays is a good idea for safeguarding, but it's not fully robust. It currently only checks for HDF5-backed arrays. AnnData also supports Zarr for backing. To make this check more comprehensive, you should also check for Zarr arrays, for example by checking if 'zarr' in module_name. Also, the docstring and comment should be updated to reflect this.

Suggested change
# Common backed array types from h5py and anndata
backed_indicators = [
"h5py" in module_name,
"Dataset" in type_name,
"AnnDataFileManager" in type_name,
"SparseDataset" in type_name,
# h5py Dataset has 'id' and 'file' attributes
hasattr(data, "id") and hasattr(data, "file"),
]
# Common backed array types from h5py, zarr, and anndata
backed_indicators = [
"h5py" in module_name,
"zarr" in module_name,
"Dataset" in type_name,
"AnnDataFileManager" in type_name,
"SparseDataset" in type_name,
# h5py Dataset has 'id' and 'file' attributes
hasattr(data, "id") and hasattr(data, "file"),
]

chunk = chunk.compute()

# Ensure numpy array with consistent dtype
return np.asarray(chunk, dtype=np.float32)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function forces the data type of the chunk to np.float32, which can cause a loss of precision if the original data is float64. The standard (shared-memory) implementation preserves the original dtype. To avoid unexpected precision loss and maintain consistency, it would be better to preserve the original data type. The subsequent calculations will handle integer types correctly by upcasting when necessary (e.g., when calculating the mean).

Suggested change
return np.asarray(chunk, dtype=np.float32)
return np.asarray(chunk)

# Sample a small chunk to check
sample_chunk = _load_chunk(adata.X, slice(0, min(100, adata.n_vars)))
frac = np.modf(sample_chunk.ravel()[:10000])[0]
is_log1p = bool(np.any(np.abs(frac) > 1e-3))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 1e-3 is hardcoded here. This value is defined as EPSILON in _utils.py. For consistency and maintainability, you should import EPSILON from _utils (e.g., from ._utils import EPSILON) and use it here.

Suggested change
is_log1p = bool(np.any(np.abs(frac) > 1e-3))
is_log1p = bool(np.any(np.abs(frac) > EPSILON))

…rential_expression_vec_wrapper

- Introduced low_memory option for optimized processing.
- Added gene_chunk_size parameter to control chunk size during computations.
- Added show_progress parameter to parallel_differential_expression and parallel_differential_expression_vec_wrapper for better user feedback during processing.
- Updated handling of fc calculations to return NaN for zero means in specific cases, improving robustness of differential expression calculations.
- Introduced a new module `_parallel.py` containing utilities for parallel processing in differential expression calculations.
- Implemented functions for default parallelization settings, Numba thread management, and processing targets in chunks.
- Added a vectorized Wilcoxon ranksum test implementation for improved performance.
- Created a new test suite in `test_parallel.py` to validate the functionality of the parallelization helpers.
- Moved ranksum buffer preparation and kernel functions from `_single_cell.py` to `_parallel.py` for better organization and performance.
- Introduced a new `prepare_ranksum_buffers` function to allocate per-thread buffers.
- Added a vectorized implementation of the ranksum test using Numba for parallel processing.
- Updated `_single_cell.py` to utilize the new ranksum test functions, enhancing modularity and code clarity.
…ameters and refactor processing logic

- Added `num_workers` and `num_threads` parameters to `_parallel_differential_expression_chunked` and `parallel_differential_expression` for improved control over parallel processing.
- Refactored target processing logic to utilize `process_target_in_chunk` and `process_targets_parallel` for better modularity and performance.
- Updated documentation to clarify the usage of new parameters and their impact on processing behavior.
- Updated progress descriptions in `process_targets_parallel` and `_parallel_differential_expression_chunked` to include the number of workers and Numba thread status for better user feedback.
- Enhanced logging to provide details on the number of threads configured for Numba, improving transparency during execution.
- Added detailed explanations for handling in-memory and backed AnnData objects in the README, clarifying execution strategies and memory management.
- Updated `parallel_differential_expression` docstring to specify the roles of `num_workers` and `num_threads` in low-memory mode, improving user understanding of parallelization options.
- Enhanced documentation for parallel processing utilities in `_parallel.py`, emphasizing their modularity and reusability.
- Introduced `is_integer_data` and `should_use_numba` functions to determine Numba applicability based on data type, improving performance for integer-like data.
- Updated `_parallel_differential_expression_chunked` to log warnings when Numba is disabled due to non-integer values, ensuring users are informed of fallback to SciPy.
- Adjusted default `num_threads` parameter in several functions to improve usability and consistency in parallel processing settings.
- Added tests for new functions to ensure correct behavior in various scenarios, enhancing overall test coverage.
…unctions

- Changed `num_workers` parameter in `parallel_differential_expression_vec_wrapper` to accept `None`, allowing for more flexible worker configuration.
- Updated index creation in `build_small_anndata` and `var` DataFrame to use `pd.Index`, enhancing compatibility with pandas operations.
- Modified `_sort_results` function to accept both `pd.DataFrame` and `pl.DataFrame`, improving versatility in handling different DataFrame types.
…ions

- Reformatted the `_compute_means` call in `process_target_in_chunk` for better readability.
- Simplified the iterable creation in `process_targets_parallel` by removing unnecessary line breaks, enhancing clarity in the code structure.
- Updated the DataFrame creation in `build_small_anndata` to streamline the dictionary definition, improving overall code conciseness.
@nick-youngblut
Copy link
Contributor Author

See #64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant