Ensembl and Gene Prediction Tools output CDS' not divisible by 3
I have been looking at a number of different ORF tools such as prodigal and GFF files from databases such as Ensembl and they both report genes/CDS' which are not divisible by 3. Examples below:
Chromosome Prodigal_v2.6.2 CDS 686 1828 131.5 + 0 ID=1_1;partial=00
Chromosome ena CDS 686 1828 . + 0 ID=CDS:AAC71217
Are we supposed to count one end of the CDS differently from another?
1828 - 686 is 1142 1142 modular 3 is 2
Is there something I am not understanding?
Many thanks.
Many thanks for the answers. While I guessed it was something like this, I could not find any information in the GFF database providers or ORF prediction tools which state what type of system is being use for any particular file/data. I see it is noted in the link that GFF files are 1-based. Is this true for all GFF files and therefore all ORF predictors 'SHOULD' conform to this?
Thanks again.
Note: The website gave an error when I tried to submit a general comment so I responded to the first answer.
GFF is by definition 1-based, but you have no guarantee that every submitter follows this. If you search long-enough you for sure will find 0-based GFF, bioinformatics is a mess after all :)
I agree but I would say this is quite unlikely. As I show here 1-based system is one of the rare thing that was well defined since the beginning of the format in 1997 i.e
Integers. <start> must be less than or equal to <end>. Sequence numbering starts at 1, so these numbers should be between 1 and the length of the relevant sequence, inclusive.