Skip to content

Conversation

@sehoffmann
Copy link

@d-v-b

This PR adds experimental support for subarray dtypes (https://numpy.org/doc/stable/glossary.html#term-subarray-data-type, https://numpy.org/doc/stable/user/basics.rec.html#structured-datatype-creation) and closes #3582 and #3583.

It also fixes support for nested (and subarray-containing) Structured dtypes for Zarr v2 which worked before in 2.18.* but not anymore 3.1.*. In particular, the buggy implementation forgot that a nested structured dtype is again a list of lists and not just a single flat list.

Note 1:
Subarray dtypes are in a very weird spot. They are a proper np.dtype, particular a np.VoidDType with unset fields attribute but set subdtype field. Hence, it makes sense to map them one-to-one to a ZDType. This also makes sense from an implementation standpoint wrt. serialization.

On the other hand, they do not have a proper scalar value. I.e. one can not create a np.void scalar for a subarray dtype (throws). Conceptually, a scalar value of a subarray dtype would be a np.ndarray. This, however, is not a subtype of np.generic despite sharing a lot of the interface. When one creates a np.ndarray with a subarray dtype directly, the result is "flat" np.ndarray with shape array_shape + subarray_shape.

I've decided to still implement them as separate Subarray-ZDType and not conflate them within the Structured class. While this works flawlessly when used within a structured dtype, the intended use case, using them directly is not fully supported. Specifically, there is no specification for standalone subarray dtypes in Zarr V2, making a lot of test cases fail. Apart from that, some tests in test_array.py do not expect an array as scalar and hence fail. I want to stress though, that I was able to successfully create and read a Subarray zarr array with V3.

Solving this conundrum adequately is beyond my possibilities and might require significant conceptual changes in Zarr. I did not add the dtype directly to test_dtype/contest.py but instead added a new test case for Structured that uses a Subarray inside which passes.

Note 2: I've also added a test case for an invalid float value string which fails due to #3584. Since that test case highlights an existing bug, I've decided to leave it there.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Nov 20, 2025
@codecov
Copy link

codecov bot commented Nov 21, 2025

Codecov Report

❌ Patch coverage is 63.15789% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.93%. Comparing base (b873691) to head (c168f9f).

Files with missing lines Patch % Lines
src/zarr/core/dtype/npy/subarray.py 58.55% 46 Missing ⚠️
src/zarr/core/dtype/npy/structured.py 80.00% 5 Missing ⚠️
src/zarr/core/dtype/common.py 71.42% 4 Missing ⚠️
src/zarr/core/dtype/__init__.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3587      +/-   ##
==========================================
+ Coverage   60.90%   60.93%   +0.03%     
==========================================
  Files          86       87       +1     
  Lines       10174    10315     +141     
==========================================
+ Hits         6196     6285      +89     
- Misses       3978     4030      +52     
Files with missing lines Coverage Δ
src/zarr/core/dtype/npy/bytes.py 53.00% <100.00%> (ø)
src/zarr/core/dtype/__init__.py 29.50% <0.00%> (-0.50%) ⬇️
src/zarr/core/dtype/common.py 33.33% <71.42%> (+5.62%) ⬆️
src/zarr/core/dtype/npy/structured.py 60.34% <80.00%> (+3.96%) ⬆️
src/zarr/core/dtype/npy/subarray.py 58.55% <58.55%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sehoffmann
Copy link
Author

@d-v-b Don't want to be pushy here, but did you manage to have a look at this PR yet? Do you have any feedback or is there anything that needs to be changed or addressed?

@d-v-b
Copy link
Contributor

d-v-b commented Dec 19, 2025

@d-v-b Don't want to be pushy here, but did you manage to have a look at this PR yet? Do you have any feedback or is there anything that needs to be changed or addressed?

Hi @sehoffmann sorry for the long silence. I think there are 2 distinct elements in this PR: first is improving how we handle numpy structured dtypes, and the second is including sub-array data types.

The first element looks great, but I have some concerns about the second element. So far we have tried to keep the set of supported data types as close as possible to the union of the data types zarr python v2 supported, plus the data types supported by other zarr v3 implementations (namely, zarrs and tensorstore).

This means when we add a new data type, there are two questions to answer: is this dtype something people used in zarr python 2, (and if does adding it resolve a feature regression)? or, is this dtype something the other zarr v3 implementations are supporting? If the answer to both of those is "no", then it seems like the maintenance burden for zarr-python might not be worth it, compared to the alternative of users registering this data type themselves via the registry. And I think sub-arrays are not something people used heavily in zarr python 2.x, nor are they supported by other zarr v3 implementations (please correct me if I'm wrong on either of these points).

How important is it for your application that this data type is bundled with Zarr python? And if that outcome is very important, would you be willing to work on a data type spec in the zarr-extensions repo? I think I'd support adding the new subarray data type unreservedly if there was buy-in from other zarr implementers. Without that buy-in, I'm pretty skeptical about the addition, and I would encourage using the data type registry to register the data type instead of relying on it being shipped wit zarr-python.

these are just my thoughts though, it would be good to hear from the other devs @zarr-developers/python-core-devs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subarray dtypes get lost on serialization / casted to void type

2 participants