-
-
Notifications
You must be signed in to change notification settings - Fork 371
Experimental Support for Subarray DTypes #3587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… for nested Structured dtypes in V2 (zarr-developers#3582, zarr-developers#3583)
221cc75 to
c21fcc6
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3587 +/- ##
==========================================
+ Coverage 60.90% 60.93% +0.03%
==========================================
Files 86 87 +1
Lines 10174 10315 +141
==========================================
+ Hits 6196 6285 +89
- Misses 3978 4030 +52
🚀 New features to boost your workflow:
|
|
@d-v-b Don't want to be pushy here, but did you manage to have a look at this PR yet? Do you have any feedback or is there anything that needs to be changed or addressed? |
Hi @sehoffmann sorry for the long silence. I think there are 2 distinct elements in this PR: first is improving how we handle numpy structured dtypes, and the second is including sub-array data types. The first element looks great, but I have some concerns about the second element. So far we have tried to keep the set of supported data types as close as possible to the union of the data types zarr python v2 supported, plus the data types supported by other zarr v3 implementations (namely, zarrs and tensorstore). This means when we add a new data type, there are two questions to answer: is this dtype something people used in zarr python 2, (and if does adding it resolve a feature regression)? or, is this dtype something the other zarr v3 implementations are supporting? If the answer to both of those is "no", then it seems like the maintenance burden for zarr-python might not be worth it, compared to the alternative of users registering this data type themselves via the registry. And I think sub-arrays are not something people used heavily in zarr python 2.x, nor are they supported by other zarr v3 implementations (please correct me if I'm wrong on either of these points). How important is it for your application that this data type is bundled with Zarr python? And if that outcome is very important, would you be willing to work on a data type spec in the zarr-extensions repo? I think I'd support adding the new subarray data type unreservedly if there was buy-in from other zarr implementers. Without that buy-in, I'm pretty skeptical about the addition, and I would encourage using the data type registry to register the data type instead of relying on it being shipped wit zarr-python. these are just my thoughts though, it would be good to hear from the other devs @zarr-developers/python-core-devs |
@d-v-b
This PR adds experimental support for subarray dtypes (https://numpy.org/doc/stable/glossary.html#term-subarray-data-type, https://numpy.org/doc/stable/user/basics.rec.html#structured-datatype-creation) and closes #3582 and #3583.
It also fixes support for nested (and subarray-containing)
Structureddtypes for Zarr v2 which worked before in 2.18.* but not anymore 3.1.*. In particular, the buggy implementation forgot that a nested structured dtype is again a list of lists and not just a single flat list.Note 1:
Subarray dtypes are in a very weird spot. They are a proper
np.dtype, particular anp.VoidDTypewith unsetfieldsattribute but setsubdtypefield. Hence, it makes sense to map them one-to-one to aZDType. This also makes sense from an implementation standpoint wrt. serialization.On the other hand, they do not have a proper scalar value. I.e. one can not create a
np.voidscalar for a subarray dtype (throws). Conceptually, a scalar value of a subarray dtype would be anp.ndarray. This, however, is not a subtype ofnp.genericdespite sharing a lot of the interface. When one creates a np.ndarray with a subarray dtype directly, the result is "flat"np.ndarraywith shapearray_shape + subarray_shape.I've decided to still implement them as separate
Subarray-ZDType and not conflate them within theStructuredclass. While this works flawlessly when used within a structured dtype, the intended use case, using them directly is not fully supported. Specifically, there is no specification for standalone subarray dtypes in Zarr V2, making a lot of test cases fail. Apart from that, some tests in test_array.py do not expect an array as scalar and hence fail. I want to stress though, that I was able to successfully create and read a Subarray zarr array with V3.Solving this conundrum adequately is beyond my possibilities and might require significant conceptual changes in Zarr. I did not add the dtype directly to
test_dtype/contest.pybut instead added a new test case forStructuredthat uses a Subarray inside which passes.Note 2: I've also added a test case for an invalid float value string which fails due to #3584. Since that test case highlights an existing bug, I've decided to leave it there.
TODO:
docs/user-guide/*.mdchanges/