I've recently noticed weird entries in hg37.63 from Ensembl. As an example, here is the first exon of trancript ENST00000310701:
1 protein_coding exon 148025761 148025848 . - . gene_id "ENSG00000122497"; transcript_id "ENST00000310701"; exon_number "1"; gene_name "NBPF14"; transcript_name "NBPF14-001";
1 protein_coding CDS 148025761 148025848 . - 2 gene_id "ENSG00000122497"; transcript_id "ENST00000310701"; exon_number "1"; gene_name "NBPF14"; transcript_name "NBPF14-001"; protein_id "ENSP00000309907";
This seems to be a protein coding transcript. Exon and CDS start and end at the same position, which means there is no UTR.
Here is the weird part: If you query Ensembl for variants at the start position and one base before, you get
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
1_148025849_A 1:148025849 A ENSG00000122497 ENST00000310701 Transcript UPSTREAM - - - - - - -
1_148025848_A 1:148025848 A ENSG00000122497 ENST00000310701 Transcript SYNONYMOUS_CODING 1 2 1 X nAa/nTa - -
So, the start base (148025848) is the SECOND base of the first codon. If you take a detailed look at the GTF definition, you'll notice a '2' on the 'frame' column.
The question is: Considering that the transcript has no UTR, is there a valid reason for the first base of the first exon to be the second base of the CDS?
I guess an alternative question is: Am I incorrect in the interpretation of this data or this looks like a bug?
According to my interpretation of this GTF 2.2 specification (http://mblab.wustl.edu/GTF22.html), the "frame" calculation on these transcripts seems to be incorrect.
It looks like there are around 5000 transcripts in hg37.63 that may have a similar problem.