Rework the pathways serialization format

The "current" pathways serialization format (what I've been calling "abomination") is a custom format that requires serializer/deserializer modification to extend.

Issue #25 perhaps requires extension of the format. Rather that extend the format, this issue proposes replacing the custom format with a standard serialization format.

Some options to consider (also a [paper](https://arxiv.org/abs/2201.02089)): 
| Serialization format | Human readable | Appendable | Multi-language impls | Standardised | Extensible | Compact |
|------------------------------|--------------------------|--------------------|--------------------------------|---------------------|----------------|--------------|
| Abomination             | ✅  | ✅  | ❌  | ❌ | ❌  | ❌ |
| YAML                          | ✅  | ✅  | ✅  | ❌  | ✅  | ❌ |
| JSON                           | ✅  | ❌  | ✅  | ✅  | ✅  | ❌  |
| SQLite                        | ❌  | ✅  | ✅  | ❌ | ✅   | ✅  |
| JSONlines                  | ✅  | ✅  | ✅  | ❌ | ✅  | ❌  |
| CBOR                         | ❌  | ❓ | ✅  | ✅  | ❓ | ✅  |
| Protobuffers            | ❌ | ❓  | ✅  | ❌  | ❓  | ✅  |
| Flatbuffers                | ❌ | ❓ | ✅  | ❌ | ❓ | ✅  |
| Avro                           | ❌ | ❓ | ✅ | ❌ | ❓ | ✅  |
| Thrift                          | ❌ | ❓ | ✅ | ❌ | ❓ | ✅ |
| Cap'n'proto               | ❌ | ❓ | ✅ | ❌ | ❓ | ✅  |
| [Twine](https://twine-data.dev/) | :question: | :question: | :question: | :question: | :question: |
| [Preserves](https://preserves.dev/) | :question: | :question: | :question: | :question: | :question: | :question: |
| [UBJSON](https://ubjson.org/) | :question: | :question: | :question: | :question: | :question: | :question: |
|[Postcard](https://postcard.jamesmunns.com/intro) | :question: | :question: | :question: | :question: | :question: | :question: |


<dl>
<dt>Human readable</dt>
<dd>Plain text encoding.</dd>

<dt>Appendable</dt>
<dd>I don't need to know about more than the single row I'm appending to the file in order to append (e.g. JSON is not appendable because of array delimiters).</dd>

<dt>Multi-language impls</dt>
<dd>There are off-the-shelf serializers/deserializes for the format in Python and at least 1 other language.</dd>

<dt>Standardised</dt>
<dd>The format is documented in an internet standard from IEEE, W3C, etc.</dd>

<dt>Extensible</dt>
<dd>When a field is added, old software can still work with data serialized with the new field.</dd>
</dl>

## Why?
I think there are a few good reasons to consider this change.
1. Current format requires specialised knowledge to understand the data format itself (not just domain knowledge of metabolomics). Using a more common format means that someone receiving the data can use an off-the-shelf parser and be confident that it works.
2. eval()-ing Python is slow and dangerous. literal_eval() is safer but still much slower that parsing, say, YAML. Also opens up the possibility of using the data in non-Python languages. I'm already doing this a little in the command-line client app I'll give you which is written in Rust. Parsing literal Python values is painful but possible outside of Python and using a more common format makes this much easier.
3. Extending the data you want to store becomes much easier. You don't have to make fundamental adjustments to the format to add a citation field. In YAML you would just add an optional key to each object in the list that has a citation. In sqlite you'd add an extra nullable column. Both these options are less data than a 3 byte empty list for each entry in your custom format (~35k for a single 12000 entry file).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the pathways serialization format #26

Why?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Serialization format	Human readable	Appendable	Multi-language impls	Standardised	Extensible	Compact
Abomination	✅	✅	❌	❌	❌	❌
YAML	✅	✅	✅	❌	✅	❌
JSON	✅	❌	✅	✅	✅	❌
SQLite	❌	✅	✅	❌	✅	✅
JSONlines	✅	✅	✅	❌	✅	❌
CBOR	❌	❓	✅	✅	❓	✅
Protobuffers	❌	❓	✅	❌	❓	✅
Flatbuffers	❌	❓	✅	❌	❓	✅
Avro	❌	❓	✅	❌	❓	✅
Thrift	❌	❓	✅	❌	❓	✅
Cap'n'proto	❌	❓	✅	❌	❓	✅
Twine	❓	❓	❓	❓	❓
Preserves	❓	❓	❓	❓	❓	❓
UBJSON	❓	❓	❓	❓	❓	❓
Postcard	❓	❓	❓	❓	❓	❓

Rework the pathways serialization format #26

Description

Why?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions