I'm trying to determine how to find the stop codon position for a protein coding transcript if there is no stop codon feature listed for it in a gencode gtf file.
For example, an insulin transcript (ENST00000421783.1) is listed as a protein coding transcript in the gencode GRCh37 gtf, and has start codon, CDS, exon features listed, but no stop codon:
chr11 HAVANA transcript 2181013 2182388 . - . gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
chr11 HAVANA exon 2182015 2182388 . - . gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1_1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160785-2161158"; remap_status "full_contig";
chr11 HAVANA CDS 2182015 2182201 . - 0 gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160785-2160971"; remap_status "full_contig";
chr11 HAVANA start_codon 2182199 2182201 . - 0 gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160969-2160971"; remap_status "full_contig";
chr11 HAVANA exon 2181013 2181102 . - . gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 2; exon_id "ENSE00001623769.1_1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2159783-2159872"; remap_status "full_contig";
chr11 HAVANA CDS 2181013 2181102 . - 2 gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 2; exon_id "ENSE00001623769.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2159783-2159872"; remap_status "full_contig";
chr11 HAVANA UTR 2182202 2182388 . - . gene_id "ENSG00000254647.6_3"; transcript_id "ENST00000421783.1_2"; gene_type "protein_coding"; gene_name "INS"; transcript_type "protein_coding"; transcript_name "INS-005"; exon_number 1; exon_id "ENSE00001725765.1"; level 2; protein_id "ENSP00000408400.1"; transcript_support_level 2; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000009558.9_3"; havana_transcript "OTTHUMT00000315845.2_2"; remap_original_location "chr11:-:2160972-2161158"; remap_status "full_contig";
If you look at the transcript sequence in ensembl (http://grch37.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000254647;r=11:2181013-2182388;t=ENST00000421783 ), there does not appear to be an in-frame stop codon. Is this truly a protein-coding transcript?
In general, it also seems like there aren't stop codon features given for a large proportion of transcripts in the gtf file. How can you determine the stop codon positions for these sequences without having to search through nucleotide sequence for each transcript?
Any specific reason you are still using GRCh37? In GRCh38 and CRCh37 this transcript is annotated as having
incomplete 3' CDS
.This was just one example - but this means the annotation is incomplete, right? And this seems to be the case for a lot of transcripts. Why do these annotations end up incomplete, and why so often? Is resolving a stop codon fairly difficult?
Since the transcript has been retained over time there must be enough evidence of its presence but clearly there the full sequence is lacking. That may be the case with many rare/alternate transcripts.
You're quite right genomax, and the 'CDS 3'incomplete' flag in the transcript table is also present in Ensembl GRCh37 and indicates that this information is missing. There is also protein evidence for this transcript from UniProtKB as you can see in the transcript table. You can look at the 'Supporting Evidence' section in the transcript tab to see what evidence has been used to support the transcript structure. Therefore there is evidence of a protein product, but the cDNA or EST evidence is not present to support the full length of the non-coding sections of the transcript.