Hg37: Bad Annotations From Ensembl?
2
2
Entering edit mode
13.3 years ago
Pablo ★ 1.9k

I've recently noticed weird entries in hg37.63 from Ensembl. As an example, here is the first exon of trancript ENST00000310701:

1       protein_coding  exon    148025761       148025848       .       -       .        gene_id "ENSG00000122497"; transcript_id "ENST00000310701"; exon_number "1"; gene_name "NBPF14"; transcript_name "NBPF14-001";
1       protein_coding  CDS     148025761       148025848       .       -       2        gene_id "ENSG00000122497"; transcript_id "ENST00000310701"; exon_number "1"; gene_name "NBPF14"; transcript_name "NBPF14-001"; protein_id "ENSP00000309907";

This seems to be a protein coding transcript. Exon and CDS start and end at the same position, which means there is no UTR.

Here is the weird part: If you query Ensembl for variants at the start position and one base before, you get

Uploaded Variation  Location    Allele  Gene    Feature Feature type    Consequence Position in cDNA    Position in CDS Position in protein Amino acid change   Codon change    Co-located Variation    Extra
1_148025849_A   1:148025849 A   ENSG00000122497 ENST00000310701 Transcript  UPSTREAM    -   -   -   -   -   -   -
1_148025848_A   1:148025848 A   ENSG00000122497 ENST00000310701 Transcript  SYNONYMOUS_CODING   1   2   1   X   nAa/nTa -   -

So, the start base (148025848) is the SECOND base of the first codon. If you take a detailed look at the GTF definition, you'll notice a '2' on the 'frame' column.

The question is: Considering that the transcript has no UTR, is there a valid reason for the first base of the first exon to be the second base of the CDS?

I guess an alternative question is: Am I incorrect in the interpretation of this data or this looks like a bug?

genome snp • 4.1k views
ADD COMMENT
0
Entering edit mode

According to my interpretation of this GTF 2.2 specification (http://mblab.wustl.edu/GTF22.html), the "frame" calculation on these transcripts seems to be incorrect.

ADD REPLY
0
Entering edit mode

It looks like there are around 5000 transcripts in hg37.63 that may have a similar problem.

ADD REPLY
4
Entering edit mode
13.3 years ago
Bert Overduin ★ 3.7k

Pablo,

If you have a look at http://www.ensembl.org/Homo_sapiens/Transcript/Sequence_cDNA?g=ENSG00000122497;r=1:148003642-148025848;t=ENST00000310701, you can see that this transcript (annotated by the Havana team) is inclomplete at the 5' end and starts at the second base of a codon. So, that should explain your observation.

Hope this helps.

By the way, it's either GRCh37 or hg19, but not hg37 .... ;)

Cheers, Bert

ADD COMMENT
0
Entering edit mode

Unfortunately this causes a lot of trouble on people parsing and understanding these incomplete annotations. I always alienate people by saying hg37 instead of GRCh37, I guess I'm too lazy to write 2 extra letters :-)

ADD REPLY
0
Entering edit mode

Unfortunately this causes a lot of trouble on people parsing and understanding these incomplete annotations.

I always alienate people by saying hg37 instead of GRCh37, I guess I'm too lazy to write 2 extra letters :-)

ADD REPLY
0
Entering edit mode

Unfortunately this causes a lot of trouble on people parsing and understanding these incomplete annotations. [?] I always alienate people by saying hg37 instead of GRCh37, I guess I'm too lazy to write 2 extra letters :-)

ADD REPLY
1
Entering edit mode
13.3 years ago
Sander Timmer ▴ 710

Without answering your question I have one advise for you. Ensembl has a dedicated Helpdesk team which you can email about questions or possible bugs. Just tell them what you did and what kind of result you expected.

You can contact them at http://www.ensembl.org/info/about/contact/index.html

ADD COMMENT

Login before adding your answer.

Traffic: 1643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6