What Is The Correct Specification For Gff3?
1
2
Entering edit mode
12.1 years ago

I am retrieving some GFF3 files for Arabidopsis from this FTP site.

The issue is that a conversion script I use to turn these into another format is getting stuck on some lines having trailing semi-colons, and other lines not. For example, here are two lines which show the contrasting problem:

Chr1    TAIR9    five_prime_UTR    3631    3759    .    +    .    Parent=AT1G01010.1
Chr1    TAIR9    CDS    3760    3913    .    +    0    Parent=AT1G01010.1,AT1G01010.1-Protein;

I can do the following to strip the semi-colon, no big deal:

$ awk '{gsub(/;$/,"");print}' TAIR9_GFF3_genes.gff | ./gff2foo
...

But I can also "fix" this long-term by editing the conversion script — and I want to address this, if the specification says this is "legal". I also don't want to introduce hacky fixes if this file is bogus.

What is the correct format for GFF3? Are trailing semi-colons allowed or are these broken GFF3 files?

gff3 gff • 2.4k views
ADD COMMENT
7
Entering edit mode
12.1 years ago

Here is a detailed specs of GFF3: http://www.sequenceontology.org/gff3.shtml

It basically says fields in the 9th column should be delimited by a semi-colon, meaning no trailing semi-colon.

But as you may have found that not everyone follows it strictly.

Here is a GFF3 validator: http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online

ADD COMMENT
1
Entering edit mode

Thanks. The spec is ambiguous but suggests that a key-value pair is needed if there is a semi-colon. Your validator was useful. It seems to think the input file is illegal for the same reason (among others), so I think the conversion script is in line with the specification.

ADD REPLY

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6