I am retrieving some GFF3 files for Arabidopsis from this FTP site.
The issue is that a conversion script I use to turn these into another format is getting stuck on some lines having trailing semi-colons, and other lines not. For example, here are two lines which show the contrasting problem:
Chr1 TAIR9 five_prime_UTR 3631 3759 . + . Parent=AT1G01010.1
Chr1 TAIR9 CDS 3760 3913 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
I can do the following to strip the semi-colon, no big deal:
$ awk '{gsub(/;$/,"");print}' TAIR9_GFF3_genes.gff | ./gff2foo
...
But I can also "fix" this long-term by editing the conversion script — and I want to address this, if the specification says this is "legal". I also don't want to introduce hacky fixes if this file is bogus.
What is the correct format for GFF3? Are trailing semi-colons allowed or are these broken GFF3 files?
Thanks. The spec is ambiguous but suggests that a key-value pair is needed if there is a semi-colon. Your validator was useful. It seems to think the input file is illegal for the same reason (among others), so I think the conversion script is in line with the specification.