-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
help wantedExtra attention is neededExtra attention is needed
Description
The "current" pathways serialization format (what I've been calling "abomination") is a custom format that requires serializer/deserializer modification to extend.
Issue #25 perhaps requires extension of the format. Rather that extend the format, this issue proposes replacing the custom format with a standard serialization format.
Some options to consider (also a paper):
| Serialization format | Human readable | Appendable | Multi-language impls | Standardised | Extensible | Compact |
|---|---|---|---|---|---|---|
| Abomination | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| YAML | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
| JSON | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
| SQLite | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ |
| JSONlines | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
| CBOR | ❌ | ❓ | ✅ | ✅ | ❓ | ✅ |
| Protobuffers | ❌ | ❓ | ✅ | ❌ | ❓ | ✅ |
| Flatbuffers | ❌ | ❓ | ✅ | ❌ | ❓ | ✅ |
| Avro | ❌ | ❓ | ✅ | ❌ | ❓ | ✅ |
| Thrift | ❌ | ❓ | ✅ | ❌ | ❓ | ✅ |
| Cap'n'proto | ❌ | ❓ | ✅ | ❌ | ❓ | ✅ |
| Twine | ❓ | ❓ | ❓ | ❓ | ❓ | |
| Preserves | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
| UBJSON | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
| Postcard | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
- Human readable
- Plain text encoding.
- Appendable
- I don't need to know about more than the single row I'm appending to the file in order to append (e.g. JSON is not appendable because of array delimiters).
- Multi-language impls
- There are off-the-shelf serializers/deserializes for the format in Python and at least 1 other language.
- Standardised
- The format is documented in an internet standard from IEEE, W3C, etc.
- Extensible
- When a field is added, old software can still work with data serialized with the new field.
Why?
I think there are a few good reasons to consider this change.
- Current format requires specialised knowledge to understand the data format itself (not just domain knowledge of metabolomics). Using a more common format means that someone receiving the data can use an off-the-shelf parser and be confident that it works.
- eval()-ing Python is slow and dangerous. literal_eval() is safer but still much slower that parsing, say, YAML. Also opens up the possibility of using the data in non-Python languages. I'm already doing this a little in the command-line client app I'll give you which is written in Rust. Parsing literal Python values is painful but possible outside of Python and using a more common format makes this much easier.
- Extending the data you want to store becomes much easier. You don't have to make fundamental adjustments to the format to add a citation field. In YAML you would just add an optional key to each object in the list that has a citation. In sqlite you'd add an extra nullable column. Both these options are less data than a 3 byte empty list for each entry in your custom format (~35k for a single 12000 entry file).
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is needed