-
Notifications
You must be signed in to change notification settings - Fork 7
feat: Add automatic low-memory mode for backed AnnData objects #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ifferential expression calculations - Added `_is_backed_array` function to check for backed/HDF5 arrays. - Introduced `_load_chunk` to handle loading of matrix chunks. - Updated `parallel_differential_expression` to automatically select processing strategy based on data type. - Enhanced error handling for backed AnnData in differential expression functions. - Improved documentation for clarity on parameters and return values.
Summary of ChangesHello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Summary of ChangesHello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Summary of ChangesHello @nick-youngblut, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant enhancement by adding an automatic low-memory mode for handling backed AnnData objects. The code is well-structured, with the logic split into a standard shared-memory implementation and a new chunked implementation for low-memory scenarios. A dispatcher function cleanly selects the appropriate backend. The changes also include valuable bug fixes, improved input validation, and more consistent docstrings. My review includes one suggestion to refactor a small piece of duplicated code into a helper function to improve maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and valuable feature by adding a low-memory, chunked processing mode for backed AnnData objects. The refactoring of the existing implementation into standard and chunked functions improves maintainability. The automatic detection of backed objects is a great usability improvement.
My review focuses on ensuring consistency between the new chunked implementation and the existing standard one, particularly regarding calculation logic and parameter handling. I've identified a few areas for improvement:
- Correcting an inconsistency in fold-change calculation when clipping is disabled.
- Fixing a potential
TypeErrorin themannwhitneyucall within the chunked path. - Enhancing the API by exposing the
show_progressparameter to the user.
These changes will improve the robustness and consistency of the new low-memory mode. Overall, this is a well-executed and important addition to the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and valuable feature: automatic low-memory processing for backed AnnData objects. The refactoring into standard and chunked implementations is well-structured and improves maintainability. The addition of the low_memory and gene_chunk_size parameters provides useful flexibility for users. The bug fixes for is_log1p detection and the typo are also appreciated. My review focuses on improving the robustness of the backed-data detection and ensuring data type consistency across different processing modes. Specifically, I've suggested using the official anndata API for checking backed status, improving the fallback detection logic, and preserving data precision during chunk loading. I've also pointed out a minor inconsistency with a hardcoded constant.
| raise ValueError(f"Unknown metric: {metric} :: Expecting: {KNOWN_METRICS}") | ||
|
|
||
| # Determine whether to use low-memory mode | ||
| is_backed = _is_backed_array(adata.X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using the custom _is_backed_array function, it would be more robust and maintainable to use the public anndata API for this check: is_backed = adata.isbacked. The isbacked property is the standard way to check if an AnnData object is backed by an on-disk file and it correctly handles different backing stores like HDF5 and Zarr. The current implementation of _is_backed_array is fragile as it relies on internal type and module names and does not cover all cases (e.g., Zarr).
| is_backed = _is_backed_array(adata.X) | |
| is_backed = adata.isbacked |
| # Common backed array types from h5py and anndata | ||
| backed_indicators = [ | ||
| "h5py" in module_name, | ||
| "Dataset" in type_name, | ||
| "AnnDataFileManager" in type_name, | ||
| "SparseDataset" in type_name, | ||
| # h5py Dataset has 'id' and 'file' attributes | ||
| hasattr(data, "id") and hasattr(data, "file"), | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function for detecting backed arrays is a good idea for safeguarding, but it's not fully robust. It currently only checks for HDF5-backed arrays. AnnData also supports Zarr for backing. To make this check more comprehensive, you should also check for Zarr arrays, for example by checking if 'zarr' in module_name. Also, the docstring and comment should be updated to reflect this.
| # Common backed array types from h5py and anndata | |
| backed_indicators = [ | |
| "h5py" in module_name, | |
| "Dataset" in type_name, | |
| "AnnDataFileManager" in type_name, | |
| "SparseDataset" in type_name, | |
| # h5py Dataset has 'id' and 'file' attributes | |
| hasattr(data, "id") and hasattr(data, "file"), | |
| ] | |
| # Common backed array types from h5py, zarr, and anndata | |
| backed_indicators = [ | |
| "h5py" in module_name, | |
| "zarr" in module_name, | |
| "Dataset" in type_name, | |
| "AnnDataFileManager" in type_name, | |
| "SparseDataset" in type_name, | |
| # h5py Dataset has 'id' and 'file' attributes | |
| hasattr(data, "id") and hasattr(data, "file"), | |
| ] |
| chunk = chunk.compute() | ||
|
|
||
| # Ensure numpy array with consistent dtype | ||
| return np.asarray(chunk, dtype=np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function forces the data type of the chunk to np.float32, which can cause a loss of precision if the original data is float64. The standard (shared-memory) implementation preserves the original dtype. To avoid unexpected precision loss and maintain consistency, it would be better to preserve the original data type. The subsequent calculations will handle integer types correctly by upcasting when necessary (e.g., when calculating the mean).
| return np.asarray(chunk, dtype=np.float32) | |
| return np.asarray(chunk) |
| # Sample a small chunk to check | ||
| sample_chunk = _load_chunk(adata.X, slice(0, min(100, adata.n_vars))) | ||
| frac = np.modf(sample_chunk.ravel()[:10000])[0] | ||
| is_log1p = bool(np.any(np.abs(frac) > 1e-3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value 1e-3 is hardcoded here. This value is defined as EPSILON in _utils.py. For consistency and maintainability, you should import EPSILON from _utils (e.g., from ._utils import EPSILON) and use it here.
| is_log1p = bool(np.any(np.abs(frac) > 1e-3)) | |
| is_log1p = bool(np.any(np.abs(frac) > EPSILON)) |
…rential_expression_vec_wrapper - Introduced low_memory option for optimized processing. - Added gene_chunk_size parameter to control chunk size during computations.
- Added show_progress parameter to parallel_differential_expression and parallel_differential_expression_vec_wrapper for better user feedback during processing. - Updated handling of fc calculations to return NaN for zero means in specific cases, improving robustness of differential expression calculations.
- Introduced a new module `_parallel.py` containing utilities for parallel processing in differential expression calculations. - Implemented functions for default parallelization settings, Numba thread management, and processing targets in chunks. - Added a vectorized Wilcoxon ranksum test implementation for improved performance. - Created a new test suite in `test_parallel.py` to validate the functionality of the parallelization helpers.
- Moved ranksum buffer preparation and kernel functions from `_single_cell.py` to `_parallel.py` for better organization and performance. - Introduced a new `prepare_ranksum_buffers` function to allocate per-thread buffers. - Added a vectorized implementation of the ranksum test using Numba for parallel processing. - Updated `_single_cell.py` to utilize the new ranksum test functions, enhancing modularity and code clarity.
…ameters and refactor processing logic - Added `num_workers` and `num_threads` parameters to `_parallel_differential_expression_chunked` and `parallel_differential_expression` for improved control over parallel processing. - Refactored target processing logic to utilize `process_target_in_chunk` and `process_targets_parallel` for better modularity and performance. - Updated documentation to clarify the usage of new parameters and their impact on processing behavior.
- Updated progress descriptions in `process_targets_parallel` and `_parallel_differential_expression_chunked` to include the number of workers and Numba thread status for better user feedback. - Enhanced logging to provide details on the number of threads configured for Numba, improving transparency during execution.
- Added detailed explanations for handling in-memory and backed AnnData objects in the README, clarifying execution strategies and memory management. - Updated `parallel_differential_expression` docstring to specify the roles of `num_workers` and `num_threads` in low-memory mode, improving user understanding of parallelization options. - Enhanced documentation for parallel processing utilities in `_parallel.py`, emphasizing their modularity and reusability.
- Introduced `is_integer_data` and `should_use_numba` functions to determine Numba applicability based on data type, improving performance for integer-like data. - Updated `_parallel_differential_expression_chunked` to log warnings when Numba is disabled due to non-integer values, ensuring users are informed of fallback to SciPy. - Adjusted default `num_threads` parameter in several functions to improve usability and consistency in parallel processing settings. - Added tests for new functions to ensure correct behavior in various scenarios, enhancing overall test coverage.
…unctions - Changed `num_workers` parameter in `parallel_differential_expression_vec_wrapper` to accept `None`, allowing for more flexible worker configuration. - Updated index creation in `build_small_anndata` and `var` DataFrame to use `pd.Index`, enhancing compatibility with pandas operations. - Modified `_sort_results` function to accept both `pd.DataFrame` and `pl.DataFrame`, improving versatility in handling different DataFrame types.
…ions - Reformatted the `_compute_means` call in `process_target_in_chunk` for better readability. - Simplified the iterable creation in `process_targets_parallel` by removing unnecessary line breaks, enhancing clarity in the code structure. - Updated the DataFrame creation in `build_small_anndata` to streamline the dictionary definition, improving overall code conciseness.
|
See #64 |
Summary
This PR adds automatic detection and handling of backed AnnData objects, enabling
parallel_differential_expressionto process datasets that don't fit in memory. When a backed AnnData is detected, the function automatically switches to a memory-efficient chunked processing strategy.Motivation
Users with large datasets (millions of cells) often load AnnData objects in backed mode (
backed="r") to avoid memory issues. Previously, this would fail with an unhelpfulAssertionError. This PR:Changes
New Features
low_memory: bool | None- Force low-memory mode (default: auto-detect)gene_chunk_size: int- Number of genes per chunk (default: 1000)O(n_cells × n_genes)toO(n_cells × chunk_size)Bug Fixes
is_log1poverride bug: Changedif not is_log1p:toif is_log1p is None:to prevent auto-detection from overriding explicitis_log1p=FalseImprovements
groupby_keyandreferenceparameters with helpful error messagestyping.Any, fixed return type annotations)_parallel_differential_expression_standard()- original shared-memory implementation_parallel_differential_expression_chunked()- new low-memory implementationUsage
Automatic backed mode handling
Force low-memory mode
Standard usage (unchanged)
Memory Comparison
Breaking Changes
None. The API is fully backward compatible: