Find stop codon for coding transcript in gencode gtf if not given?
1
1
Entering edit mode
7.1 years ago

I'm trying to determine how to find the stop codon position for a protein coding transcript if there is no stop codon feature listed for it in a gencode gtf file.

For example, an insulin transcript (ENST00000421783.1) is listed as a protein coding transcript in the gencode GRCh37 gtf, and has start codon, CDS, exon features listed, but no stop codon:

chr11   HAVANA  transcript  2181013 2182388 .   -   .   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
chr11   HAVANA  exon    2182015 2182388 .   -   .   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1_1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160785-2161158"; remap_status "full_contig";
chr11   HAVANA  CDS 2182015 2182201 .   -   0   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160785-2160971"; remap_status "full_contig";
chr11   HAVANA  start_codon 2182199 2182201 .   -   0   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160969-2160971"; remap_status "full_contig";
chr11   HAVANA  exon    2181013 2181102 .   -   .   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 2; exon_id "ENSE00001623769.1_1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2159783-2159872"; remap_status "full_contig";
chr11   HAVANA  CDS 2181013 2181102 .   -   2   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 2; exon_id "ENSE00001623769.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2159783-2159872"; remap_status "full_contig";
chr11   HAVANA  UTR 2182202 2182388 .   -   .   gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160972-2161158"; remap_status "full_contig";

If you look at the transcript sequence in ensembl (http://grch37.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000254647;r=11:2181013-2182388;t=ENST00000421783 ), there does not appear to be an in-frame stop codon. Is this truly a protein-coding transcript?

In general, it also seems like there aren't stop codon features given for a large proportion of transcripts in the gtf file. How can you determine the stop codon positions for these sequences without having to search through nucleotide sequence for each transcript?

gtf gencode ensembl stop codon • 4.0k views
ADD COMMENT
1
Entering edit mode

Any specific reason you are still using GRCh37? In GRCh38 and CRCh37 this transcript is annotated as having incomplete 3' CDS.

ADD REPLY
0
Entering edit mode

This was just one example - but this means the annotation is incomplete, right? And this seems to be the case for a lot of transcripts. Why do these annotations end up incomplete, and why so often? Is resolving a stop codon fairly difficult?

ADD REPLY
1
Entering edit mode

Since the transcript has been retained over time there must be enough evidence of its presence but clearly there the full sequence is lacking. That may be the case with many rare/alternate transcripts.

ADD REPLY
0
Entering edit mode

You're quite right genomax, and the 'CDS 3'incomplete' flag in the transcript table is also present in Ensembl GRCh37 and indicates that this information is missing. There is also protein evidence for this transcript from UniProtKB as you can see in the transcript table. You can look at the 'Supporting Evidence' section in the transcript tab to see what evidence has been used to support the transcript structure. Therefore there is evidence of a protein product, but the cDNA or EST evidence is not present to support the full length of the non-coding sections of the transcript.

ADD REPLY
1
Entering edit mode
7.1 years ago

By the definition, the start and stop codons should be included in features named CDS

CDS: A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.

http://www.sequenceontology.org/browser/current_svn/term/SO:0000316

Hence the end coordinate of a CDS should indicate the last base of the stop codon.

In practice, and surprisingly enough, there are inconsistencies and not all data sources obey the standard when naming the features. The easiest way to check is to load your GFF into IGV then visually verify where start/stop codons are (use the Show Translations) feature on the sequence track.

ADD COMMENT
0
Entering edit mode

My curiosity is why so many protein coding transcripts are poorly annotated (i.e. lack an annotated start or stop codon). Around 1/3 of the transcripts described as protein coding in the gencode 27 release GTF file lack a start codon feature, stop codon feature, or both. Why is it that these features are so inconsistently available?

ADD REPLY

Login before adding your answer.

Traffic: 1631 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6