Question

Genbank Flat File (.Gb) Proper Usage

1

Entering edit mode

11.4 years ago

mobiusklein ▴ 180

I'm attempting to convert my collection of scattered annotations into a unified GenBank Flat File. I've been looking at how different programs interact with the format, ranging from only accepting a set of the feature types, while others arbitrarily shoehorn the data into a feature type, and still others simply use the feature type as a sort of analog XML for loading their annotations in and out.

Is there a "Right" way to use the GB format?

I've seen the GFF format, but the GFF3 specification separates annotation data from sequence data. This is really good for big, abstracted data models, but isn't what I'm looking for at the moment because I still need to be able to convey that linkage to myself when I read my own data. Am I wrong to avoid GFF for this reason? What are other file formats I should look at?

Thank you

genbank • 7.2k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 11.4 years ago by mobiusklein ▴ 180

score 1 · Answer 1 · 2013-08-23

GenBank or EMBL format work well for sequence annotation, and would be a good choice if you're thinking about submitting your annotated genome to the NCBI/EMBL/DDBJ - just follow the standard rather than deviating too far. For example, don't make up your own feature types! e.g. see http://www.insdc.org/files/feature_table.html

GFF3 does allow you to include the sequences too at the end of the file in a FASTA section, however it is commonly held as two files (GFF3 and FASTA). This makes sense if your annotation goes though several revisions while the sequence doesn't change. See http://www.sequenceontology.org/gff3.shtml

(A related question is what tools are recommended for working with these file format, e.g. graphical editors, parsers & writers, etc)