gff3validator: gff3 validation error
2
1
Entering edit mode
8.6 years ago
firestar ★ 1.6k

I downloaded the gff3 annotation file for Zebrafish from NCBI and ran gff3validator from genometools as such

gt gff3validator GCF_000002035.5_GRCz10_genomic.gff

and I get the following error:

gt gff3validator: error: CDS feature on line 621884 in file "GCF_000002035.5_GRCz10_genomic.gff" has the wrong phase 1 (should be 0)

I have not modified the file in any way. What does this error mean? Should I be concerned that it could lead to potential issues? How can this be fixed? Thanks.

RNA-Seq annotation gff • 4.0k views
ADD COMMENT
0
Entering edit mode

Why do you need to "validate" this data from NCBI? Is there a tool or analysis that is not working with these annotations?

ADD REPLY
0
Entering edit mode

FWIW I would hope that all GFF in refseq validates.

ADD REPLY
0
Entering edit mode

I would think so, and this is one of the primary model species used in biomedical research so I doubt there are any major issues. It seems rather academic to validate a file by one definition just for the sake of it.

ADD REPLY
4
Entering edit mode
8.6 years ago

I have become interested in tracking down this and ran what the OP did. Basically I am trying to determine if this is a GFF file format error or a validation error.

I do get the same error as the OP:

gt gff3validator: error: CDS feature on line 621884 in file "GCF_000002035.5_GRCz10_genomic.gff" has the wrong phase 1 (should be 0

The obtain the line:

 cat GCF_000002035.5_GRCz10_genomic.gff | awk ' NR==621884 { print $0 } '

This produces the offending line:

NC_007121.6 BestRefSeq  CDS 21750127    21750129    .   +   1   ID=cds20891;Parent=rna31380;Dbxref=GeneID:553997,Genbank:NP_001019280.1,ZFIN:ZDB-GENE-050609-5;Name=NP_001019280.1;Note=The RefSeq protein has 9 substitutions compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=CDS;gene=pcdh1g9;product=protocadherin 1 gamma 9;protein_id=NP_001019280.1

Looking at the genbank format at https://www.ncbi.nlm.nih.gov/nuccore/NC_007121.6 it shows:

join(21744267..21746679,21746681,21750127..21750129,
                 21750131..21750164,21855170..21855228,21858956..21859074,
                 21860119..21860155,21860973..21860979)
                 /gene="pcdh1g9"
                 /gene_synonym="DrPcdh1g8"
                 /inference="similar to AA sequence (same
                 species):RefSeq:NP_001019280.1"
                 /exception="annotated by transcript or proteomic data"
                 /note="The RefSeq protein has 9 substitutions compared to
                 this genomic sequence; Derived by automated computational
                 analysis using gene prediction method: BestRefSeq."
                 /codon_start=1
                 /product="protocadherin 1 gamma 9"
                 /protein_id="NP_001019280.1"
                 /db_xref="GI:66773380"
                 /db_xref="GeneID:553997"
                 /db_xref="ZFIN:ZDB-GENE-050609-5"

This shows that it is the third CDS that raises the error. Add up the lengths of the previous CDS sizes and see how far are we from the multiple of 3. That would be the phase.

>>> size = 21746679 - 21744267 + 1 + 1
>>> divmod(size, 3)
(804, 2)

The remainder is 2, this means that the next codon starts one base in. So phase should be 1.

Basically telling us that the GFF is correct and the validator is incorrect.

ADD COMMENT
2
Entering edit mode
8.6 years ago

To check what the phase means see:

http://www.sequenceontology.org/gff3.shtml

For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region.

Your validator seems to say that the way the data is entered into the GFF is invalid. For example this is either the first CDS or the lenght of the sequence so far is such that the codon should start right away but instead it claims to be 1 base away.

Some tools use the CDS phase information, many do not. It is still concerning that the data seems incorrect.

Of course it is possible that the validator made a mistake too but usually it is less likely.

ADD COMMENT

Login before adding your answer.

Traffic: 1295 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6