I am using Ensembl's Perl API to retrieve (nucleotide) coding sequences from the Ensembl databases. The relevant part of my Perl code is
my $gene_adaptor = $registry->get_adaptor($species, "core", "gene");
$genes = $gene_adaptor->fetch_all();
for $gene (@$genes) {
my $transcript = $gene->canonical_transcript;
if ($transcript->translation) {
my $sequence = $transcript->translateable_seq;
unless (length($sequence) % 3 == 0) {
print $gene->stable_id . " translateable sequence not divisible by 3\n";
next;
}
if ($sequence =~ /[^TCAG]/) {
print $gene->stable_id . " translateable sequence has a non-TCAG character\n";
next;
}
print ENS $gene->stable_id . "\t" . $transcript->stable_id . "\n";
print ENS $transcript->translateable_seq . "\n";
$succeeded++;
}
}
The two sanity checks on the translateableseqs returned - that they be multiples of 3 nucleotides long and contain no non-TCAG characters - are each triggered a substantial number of times when the script is run on the human or mouse genomes. If I understand what translateableseq claims to be returning correctly, this should not be the case. Is there a problem with my understanding, or is there a problem with Ensembl's API and/or databases?
which gene is it ?
Many, many genes - ~10000 out of the ~30000 for mouse, for example.
can you give one example please.
ENSMUSG00000064363, canonical transcript ENSMUST00000082414, returns a translateable_seq that is 1378 nt long.