Question

How to get the protein sequence of a transcript variant (Ensembl Perl API) ?

0

Entering edit mode

9.5 years ago

nathan.alary • 0

Hello,

I'm using the Ensembl Perl API provided by Ensembl in order to collect information about a set of genes (ENST IDs), all the transcripts from those genes (ENSG IDs), and all the variants from those transcripts (rs/cos/... IDs).

I'm actually looking for an efficient way to get the protein sequences translated from each transcript variant (variations included), i.e. one protein sequence per transcript variation.

Being unable to directly get the whole protein sequence neither the nucleotide sequence of a transcript variation in order to translate it (maybe I missed something), I had the idea of using a Slice object between the start and end positions of each transcript then translate the returned sequences based on the location and type of each transcript variation. But I realized it was tedious to handle frame-shift variants and some variation cases.

That's the reason why I'm looking for a more direct and efficient way to get those protein sequences.

If someone has the answer or any suggestion, I would greatly appreciate it.

Regards,
Nathan

Ensembl Perl-API • 2.4k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by nathan.alary • 0

Ram · Answer 1 · 2016-07-08

1

Entering edit mode

8.4 years ago

Emily 24k

There's no single API call that does what you need, but there are a couple of VEP plugins that you can cannibalise. The ProteinSeqs plugin creates a new protein sequence based on a missense variants, while the Downstream plugin gives you the sequence downstream of a frameshift.

ADD COMMENT • link updated 22 months ago by Ram 44k • written 8.4 years ago by Emily 24k

Ram · Answer 2 · 2016-07-07

Hi Nathan, I'm looking for some function that performs that task directly in ensembl perl API, but meanwhile I'm using the following functions:

# param 0 -> TrancscriptVariationAllele object.
# return the coding sequence of the transcript with the mutation.
sub get_variation_cds_seq{
    my $tva = $_[0];
    # translateable_seq returns the coding part of the transcript
    # (it removes introns and 5' and 3' utr)
    my $seq = $tva->transcript->translateable_seq;
    if (!defined($tva->transcript_variation->cds_start) || !defined($tva->transcript_variation->cds_end)){
        print "ERROR" . $tva->transcript_variation->variation_feature->variation_name . " " . $tva->transcript->display_id . "\n";
    }
    # Variation position starting at the begining of coding sequence.
    my $variation_start = $tva->transcript_variation->cds_start - 1;
    my $variation_end = $tva->transcript_variation->cds_end - 1;
    # If is a deletion, feature_seq is '-', so we will use '' instead
    # to build the final sequence.
    my $feature_seq = $tva->feature_seq eq "-" ? "" : $tva->feature_seq;
    substr($seq, $variation_start, $variation_end - $variation_start + 1) = $feature_seq;

    return $seq;
}

# param 0 -> TrancscriptVariationAllele object.
# return the sequence of the transcript with the mutation, including 5' and 3'.
sub get_variation_cdna_seq{
    my $tva = $_[0];
    # seq contains 5' and 3' regions.
    my $seq = $tva->transcript->seq->seq;
    if (!defined($tva->transcript_variation->cdna_start) || !defined($tva->transcript_variation->cdna_end)){
        print "ERROR" . $tva->transcript_variation->variation_feature->variation_name . " " . $tva->transcript->display_id . "\n";
    }
    # Variation position counting utr regions.
    my $variation_start = $tva->transcript_variation->cdna_start - 1;
    my $variation_end = $tva->transcript_variation->cdna_end - 1;
    # If is a deletion, feature_seq is '-', so we will use '' instead
    # to build the final sequence.
    my $feature_seq = $tva->feature_seq eq "-" ? "" : $tva->feature_seq;
    substr($seq, $variation_start, $variation_end - $variation_start + 1) = $feature_seq;

    return $seq;
}

Suggestions will be appreciated. It would be great that this functionality were implemented in perl api correctly because my functions sometimes fails due to start or end of the variation is not defined.

Regards,
Fran.