I found a gff file from NCBI datasets https://www.ncbi.nlm.nih.gov/datasets/ that appears to have a non-compliant formatting. I find lines where the start position is higher than the stop position. Here is an example line:
NC_007982.1 RefSeq mRNA 691776 267232 . ? . ID=rna-ZeamMp017;Parent=gene-ZeamMp017;Dbxref=GeneID:37545003;gbkey=mRNA;gene=nad1;locus_tag=ZeamMp017
Note that the 691776 in column 4 is greater than the 267232 in column 5. According to the gff3 spec at https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md this is not allowed. Consequently, this file cannot be loaded into my genome browser, (jBrowse2), which seems strict about the formatting.
The gff file came in the dataset for GCF_902167145.1 (Zea mays version 5).
My questions are:
- Am I right that this is a mis-formatted gff file?
- Has anyone seen this in other gff files from RefSeq / NCBI datasets? Is this a Refseq-wide issue or just an issue with this particular maize dataset?
Yes I know I can parse and remove or fix the defects with minimal scripting approaches. But, malformed gff files should be fixed at NCBI datasets I would think.
yes