Skip to content

Rework the pathways serialization format #26

@MaybeJustJames

Description

@MaybeJustJames

The "current" pathways serialization format (what I've been calling "abomination") is a custom format that requires serializer/deserializer modification to extend.

Issue #25 perhaps requires extension of the format. Rather that extend the format, this issue proposes replacing the custom format with a standard serialization format.

Some options to consider (also a paper):

Serialization format Human readable Appendable Multi-language impls Standardised Extensible Compact
Abomination
YAML
JSON
SQLite
JSONlines
CBOR
Protobuffers
Flatbuffers
Avro
Thrift
Cap'n'proto
Twine
Preserves
UBJSON
Postcard
Human readable
Plain text encoding.
Appendable
I don't need to know about more than the single row I'm appending to the file in order to append (e.g. JSON is not appendable because of array delimiters).
Multi-language impls
There are off-the-shelf serializers/deserializes for the format in Python and at least 1 other language.
Standardised
The format is documented in an internet standard from IEEE, W3C, etc.
Extensible
When a field is added, old software can still work with data serialized with the new field.

Why?

I think there are a few good reasons to consider this change.

  1. Current format requires specialised knowledge to understand the data format itself (not just domain knowledge of metabolomics). Using a more common format means that someone receiving the data can use an off-the-shelf parser and be confident that it works.
  2. eval()-ing Python is slow and dangerous. literal_eval() is safer but still much slower that parsing, say, YAML. Also opens up the possibility of using the data in non-Python languages. I'm already doing this a little in the command-line client app I'll give you which is written in Rust. Parsing literal Python values is painful but possible outside of Python and using a more common format makes this much easier.
  3. Extending the data you want to store becomes much easier. You don't have to make fundamental adjustments to the format to add a citation field. In YAML you would just add an optional key to each object in the list that has a citation. In sqlite you'd add an extra nullable column. Both these options are less data than a 3 byte empty list for each entry in your custom format (~35k for a single 12000 entry file).

Metadata

Metadata

Labels

help wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions