Question

Ensembl Perl Api Translateable_Seq Returns Sequences That Aren'T Multiples Of 3 Nucleotides Long

3

Entering edit mode

14.2 years ago

Jeff Hussmann ▴ 120

I am using Ensembl's Perl API to retrieve (nucleotide) coding sequences from the Ensembl databases. The relevant part of my Perl code is

my $gene_adaptor = $registry->get_adaptor($species, "core", "gene");
$genes = $gene_adaptor->fetch_all();
for $gene (@$genes) {
    my $transcript = $gene->canonical_transcript;
    if ($transcript->translation) {
        my $sequence = $transcript->translateable_seq;
        unless (length($sequence) % 3 == 0) {
            print $gene->stable_id . " translateable sequence not divisible by 3\n";
            next;
        }
        if ($sequence =~ /[^TCAG]/) {
            print $gene->stable_id . " translateable sequence has a non-TCAG character\n";
            next;
        }
        print ENS $gene->stable_id . "\t" . $transcript->stable_id . "\n";
        print ENS $transcript->translateable_seq . "\n";
        $succeeded++;
    }
}

The two sanity checks on the translateableseqs returned - that they be multiples of 3 nucleotides long and contain no non-TCAG characters - are each triggered a substantial number of times when the script is run on the human or mouse genomes. If I understand what translateableseq claims to be returning correctly, this should not be the case. Is there a problem with my understanding, or is there a problem with Ensembl's API and/or databases?

ensembl api perl • 4.6k views

ADD COMMENT • link updated 14.2 years ago by Ian Longden • 0 • written 14.2 years ago by Jeff Hussmann ▴ 120

0

Entering edit mode

which gene is it ?

ADD REPLY • link 14.2 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Many, many genes - ~10000 out of the ~30000 for mouse, for example.

ADD REPLY • link 14.2 years ago by Jeff Hussmann ▴ 120

0

Entering edit mode

can you give one example please.

ADD REPLY • link 14.2 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

ENSMUSG00000064363, canonical transcript ENSMUST00000082414, returns a translateable_seq that is 1378 nt long.

ADD REPLY • link 14.2 years ago by Jeff Hussmann ▴ 120

Ram · Answer 1 · 2011-01-24

Hi Jeff,

A portion of Ensembl genes and transcripts come from manual annotation (by the VEGA/Havana project). The manual annotators use EST and cDNA evidence to determine the transcript set- and they do annotate partial codons. Basically, they try for the longest sequence they can- if the cDNA is not complete, they will annotate the cDNA through a partial codon, rather than leaving off the codon altogether. It sounds like most of your examples are manual annotation by Havana. The Ensembl pipeline does not allow partial codons, so transcripts coming from the automatic annotation pipeline will not have partial codons.

As for the documentation saying "all defined RNA edits", these would be seleno-cysteines and other non-standard amino acids- this does not include extending a codon to three nucleotides.

The particular mRNA being discussed is a mitochondrial cDNA on the mouse genome (which are also manually annotated):

http://www.ensembl.org/Mus_musculus/Transcript/Exons?db=core;g=ENSMUSG00000064363;r=MT:10167-11544;t=ENSMUST00000082414

The partial codon comes straight from the original record pointed out by Pierre:

http://www.ncbi.nlm.nih.gov/nuccore/34538597

Though two A nucleotides are expected to complete the last codon (and form a stop codon), Ensembl is only able to show the first T. The reason is, Ensembl translates cDNAs off the genome itself, and the genome is telling us that after that last T in the coding sequence, a G and a T follow (not two As). You can see this in the Transcript/Exons view (link above).

I hope this helps?

By the way, these types of questions can either be sent to helpdesk@ensembl.org, or consider joining the dev mailing list for discussion.

There is quite a lot of API discussion on the dev list.

score 1 · Answer 2 · 2011-01-22

1

Entering edit mode

14.2 years ago

Pierre Lindenbaum 165k

In the case of your mRNA ENSMUSG00000064363 , as far as I can see, it is the Ensembl Transcript for NP_904337 (459aa) where it is said that

"TAA stop codon is completed by the addition of 3' A residues to the mRNA"

(see also the sequence of the mitochondrial genome with the same comment: http://www.ncbi.nlm.nih.gov/nuccore/34538597 ).

I don't know why there is an error in Genbank (error in sequencing ?) but the Ensembl API just used this information without modifying it.

ADD COMMENT • link 14.2 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

The translateable_seq documentation says that it applies "all defined RNA edits" to sequences before returning them - so essentially in the case of this (randomly chosen and of no special interest to me) gene there appears to be a known RNA edit that hasn't made its way into the EnsEMBL database. Do I conclude that there are 10,000 other cases of this and lose faith in the EnsEMBL database's ability to get these sequences right?

ADD REPLY • link 14.2 years ago by Jeff Hussmann ▴ 120

Ram · Answer 3 · 2011-01-24

In addition to the answer given :-

If your script is going very slow you may want to edit it to do either of the following.

The script listed in the original question will use a lot of memory as all the genes are loaded at once and then the transcripts are obtained and then the sequence, which are all kept for the whole of the script. To reduce the memory overhead do either:-

1)

while (my $gene = shift @$genes){
}
#
#So after the loop the gene, transcript and sequence are removed.
#

2)

my $gene_ids = $gene_adaptor->list_dbIDs();
foreach my $gene_id (@$gene_ids) {
    my $gene = $gene_adaptor->fetch_by_dbID($gene_id);
    ...
}
#
#Only one gene object exists at once, and we just have an array of the internal
#identifiers.
#

-Ian.