Question

Why annotations are not encoded using standard informatic format ?

3

Entering edit mode

8.5 years ago

jsgounot ▴ 170

Hi,

I don't know if this kind of question can be asked on BioStar. I quite don't understand why, with an informatic point of view, current annotations formats as gff, embl, or whatever, are used while standard informatic formats such as json could work. I feel like these formats were used to make the data more "human readable" but since everyone are using these formats the way they want, sometimes it becomes really hard to parse the same way 2 annotations files. Moreover, format such as json have efficient parser in most of programming langages. Am I missing something ?

annotations open-question encoding • 2.5k views

ADD COMMENT • link updated 8.5 years ago by lh3 33k • written 8.5 years ago by jsgounot ▴ 170

2

Entering edit mode

GFF predates JSON by several years...

ADD REPLY • link 8.5 years ago by User 59 13k

1

Entering edit mode

Genbank format may predate JSON by a decade or so, when most Unix tools crashed if there was a line longer than 1024 bytes.

ADD REPLY • link 8.5 years ago by lh3 33k

score 4 · Answer 1 · 2016-11-28

A couple of thoughts...

I feel like these formats were used to make the data more "human readable"

Actually I would say the main advantage of GFF/BED etc is that they can be easily parsed with stream tools like sort, sed, grep, awk and (yes) Excel. Also, I'm not sure json could be indexed in the same way as tabular files can with methods like Tabix.

Another reason could be that these formats have been around long enough that switching to something completely different is impractical (a similar pount was made for bam vs cram or other formats for alignments).

score 2 · Answer 2 · 2016-11-28

I quite don't understand why, with an informatic point of view, current annotations formats as gff, embl, or whatever, are used while standard informatic formats such as json could work

I would say it's because most people use line-oriented/linux tools to process the data.

While I agree you with the structured formats (JSON/XML/ASN.1/... ) , it would really make things harder to quickly select/filter the data where a simple grep/awk would be fine.

score 2 · Answer 3 · 2016-11-28

Part of the deal with JSON is buy-in. You need people to be able to parse, chop, and filter it easily. There are tools like jq and the like that make this doable on the command-line, but the interface has a learning curve.

Further, these tools are not part of a standard Unix setup, whereas utilities like cut, join, awk etc. are readily available, stable, with interfaces that haven't changed in years.

Also, you need a JSON structure that is consistent. Line-based formats have set fields separated by columns, or delimiters within a field for multiple records, so there is more consistency for some formats.

JSON is more open-ended in terms of what you can put into it, and that means more ambiguity when parsing. Some approaches to resolving this include enforcing schemas, e.g., the use of JSON Schema in other scientific fields (http://json-schema.org/) to enforce structure and type validation.

BSON (binary JSON) is an option for indexed lookups. One could imagine a future binary sequencing format that uses something like this, perhaps, but it isn't textual, so you'd need special tools to do queries and processing, much as non-standard tools are required for tabix.

The WashU regulatory browser uses a format for its annotations that mixes JSON and BED. The first three columns represent the position and size of a genomic interval, while the remainder is a JSON-like string (possibly strictly JSON, haven't looked in a while) that sets up key-value pairs for annotation attributes, like gene name, strand, etc. This hybrid approach gives the user the advantages of fast binary searches on sorted BED input, line parsing with common Unix tools, and the open-ended extensibility of a JSON object to describe attributes.

score 2 · Answer 4 · 2016-11-28

Annotations are graph-based data. 1bp of DNA can have many annotations, and 1 annotation can refer to many bases of DNA (non-consecutive). Unfortunately, none of the formats you just described are graphs, so yes they are all a poor choice (bioinformatics has a long and proud history of using the wrong data-structure for the job).

However, json would be even worse. While json is used by many people in industry to store highly-relational data, the consensus over the last 4 years is that this was a really bad idea. Highly-relational data doesn't fit into SQL nicely, so the solution was to throw structure out of the window and use MongoDB, Redis, and all these other document-stores backed by json. This was a huge and costly screwup for many in industry - mainly taking out startups you'll never hear about as a result. Diaspora (distributed facebook) is the classic example. json is for storing unstructured data - not highly relational data. Don't let people conflate the two. Use json when no assumptions about the data can be made - it has no schema after all. For anything else, use a proper protocol buffer like msgpack, or even more structured with protobuf/etc. The only exception is when you're using jsonb in PostgreSQL and soon to be all the other SQL engines.

Also, being human-readable is a curse not a benefit. Users often conflate human readable with human writable, and for unstructured data like json that can be really bad. Bioinformatic file formats should be all-binary all the time. The only time people defend non-binary fileformats is when they want to use awk/grep/sed/etc instead of a proper parser - and all the safety benefits real parsers bring.

score 1 · Answer 5 · 2016-11-28

In addition to what others said, JSON is primarily a serialization format. We do not have an official standard to formally define JSON schemas. Without a schema specifying field names and types, a program won't be able to retrieve data. This problem seems small but bites you hard when you go down to the implementation details. It has been a big headache to GA4GH. GA4GH finally settles down to protobuf, but relying on protobuf means a non-trivial dependency and will push biologists away.