Question

How To Retrieve A Protein Sequence Given An Ensembl Gene Id Using Perl

1

Entering edit mode

13.6 years ago

Saad Murtaza Khan ▴ 80

Hi i have a list of ensembl gene id's i need to get their corresponding protein sequences using perl.Kindly suggest how to achieve this using ensemblAPI

perl ensembl homework • 11k views

ADD COMMENT • link updated 11.2 years ago by Kanhu charan Moharana ▴ 10 • written 13.6 years ago by Saad Murtaza Khan ▴ 80

7

Entering edit mode

Answered largely here.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.6 years ago by Michael Kuhn 5.0k

3

Entering edit mode

have you tried anything so far? Is there a specific thing you are stuck with?

ADD REPLY • link 13.6 years ago by Simon Cockell 7.4k

2

Entering edit mode

I suggest reading the documentation and trying the examples. Then ask again if you have difficulty.

ADD REPLY • link 13.6 years ago by Neilfws 49k

0

Entering edit mode

I suggest reading the documentation and trying the examples ;-)

ADD REPLY • link 13.6 years ago by Neilfws 49k

0

Entering edit mode

I can say from my own experience with EnsEMBL that this task isn't as easy as normally perceived. A gene id can be linked to possibly many transcripts. Each one can be linked to a protein id. But, many protein id represent exactly the same protein, differing only at transcript level. Many genes don't have the is_canonical attribute set to a value. So, I suggest a more precise specification of your question. Otherwise, go to biomart, paste your gene id list as a filter and download the data as a csv file.

ADD REPLY • link 13.6 years ago by Jarretinha 3.4k

score 6 · Answer 1 · 2011-04-04

It's clearly stated in the Ensembl core API tutorial that you can get protein sequence from Transcript object.

Translation objects and protein sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudo-genes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object not a Translation object as might be expected. The following example obtains the protein sequence of a Transcript and the Translation's stable identifier

my $stable_id = 'ENST00000044768';

my $transcript_adaptor =
  $registry->get_adaptor( 'Human', 'Core', 'Transcript' );
my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id);

print $transcript->translation()->stable_id(), "\n";
print $transcript->translate()->seq(),         "\n";

Is it that hard to go through the documentation?

score 3 · Answer 2 · 2011-04-05

3

Entering edit mode

13.6 years ago

Giulietta - Ensembl Helpdesk ★ 1.2k

This is in the Ensembl documentation as has been pointed out. You say you need to go through the Perl API- but this would actually be easier in BioMart. If that's an option for you, watch this tutorial video.

There is a BioMart web interface you can use. Filters would be your IDs, and Attributes would be the sequences page, protein sequences.

ADD COMMENT • link 13.6 years ago by Giulietta - Ensembl Helpdesk ★ 1.2k

0

Entering edit mode

BioMart works well! Extrally, you can use a R package called biomaRt to achieve this!

ADD REPLY • link 13.6 years ago by ct586 • 0

Neilfws · Answer 3 · 2011-04-04

0

Entering edit mode

13.6 years ago

Panagiotis Alexiou ▴ 220

I believe you could use the Ensembl API that is provided by Ensembl and can be found at their site. It allows perl programs to access their database.

If your question is more specific it would be nice to know.

ADD COMMENT • link updated 13.6 years ago by Neilfws 49k • written 13.6 years ago by Panagiotis Alexiou ▴ 220

score 0 · Answer 4 · 2013-09-05

Here is a PERL script using LWP::Simple Module, to retrive any kind of sequence linked to a Ensemble Transcript ID. It worked for me, hope other can use it with simple modification.

Usage: perl SCRIPT_NAME.pl FILE_CONTAINING_ENSEMBL_ID

YOu can edit the script to fetch specific annotations, like cds sequence, cdna, peptide, exons or introns.

+++++++++++++++++++++PERL CODE+++++++++++++++++++++

### Script to retrive ensembl sequence using ensembl trascript ID

use strict;
use LWP::UserAgent;
use LWP::UserAgent;
use LWP::Simple;
use HTTP::Cookies;


my $input_file=shift|| die "Insufficient Parameters!!!\n Usage: perl $0  <FILE CONATIING_ENSEMBLE_IDS="">\n File must have one id per line.\n";

open(IN,"$input_file") or die "$! $input_file\n";
my @inputs=<IN>;
print STDERR "You have entered ".scalar @inputs." IDs\n\n";
my $ensmbl_ids=join "",@inputs;

$ensmbl_ids=~s/\n/\t/g;
#print "$ensmbl_ids\n";

my $flank3_display=0;            ##upstream, downstream
my $flank5_display=0;
my $strand='strand';                ## 1, forwd or -1 revrese
my $output='fasta';                ## output format, bed,csv,tab, gtf, gff, gff3, embl, genbank

my $fasta_genomic='off';        #unmasked,soft_masked, hard_masked, 5_flanking, 3_flanking, 5_3_flanking

########################EDIT TYPE OF SEQUENCE TO FETCH######################################
#use 0 to turn off and 1 to turn on; default all 'ON'
my $cdna='1';
my $coding='1';
my $peptide='1';
my $utr5='1';
my $utr3='1';
my $exon='1';
my $intron='1';
#############################################################################################




#===================UNIPRTO BOT=====================source: uniprot site
my $base = 'http://www.uniprot.org';
my $tool = 'mapping';

my $params = {
  to => 'ACC',
  from => 'ENSEMBL_TRS_ID',                    
  format => 'tab',
  query =>  $ensmbl_ids,
};

my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';

my $response = $agent->post("$base/$tool/", $params);

while (my $wait = $response->header('Retry-After')) {
  print STDERR "Waiting ($wait)...\n";
  sleep $wait;
  $response = $agent->get($response->base);
}


my %ensemle_id_acc_id;
if($response->is_success ){ my @l=split (/\n/, $response->content);  foreach my $l(@l) {my($k,$v)=split(/\s+/,$l); $ensemle_id_acc_id{$k}=$v if $k ne 'From';  }    }
else{die 'Failed, got ' . $response->status_line .    ' for ' . $response->request->uri . "\n";}


foreach(sort keys %ensemle_id_acc_id)
{
    print "#Ensembl_ID=$_\tUniprot_ACC_ID: $ensemle_id_acc_id{$_}\n";
    my $uniprot_url='http://www.uniprot.org/uniprot/'.$ensemle_id_acc_id{$_}.'.txt';
    my $content_uniprot = get $uniprot_url;
    my $org_code;
    if($content_uniprot=~ m/OS\s+(\S+)\s+(\S+)\s*/i) {

     $org_code=lc($1)."_".lc($2);            ## Fetching Organism name from Uniprot;

    print "#Uniprot Organism: $1 $2\n";            
    if($org_code)
            {
                ## constructing ensembl URL
                my $ensembl_url='http://www.ensembl.org/'.$org_code.'/Export/Output/Transcript?db=core;'.'flank3_display='.$flank3_display.';flank5_display='.$flank5_display.';output='.$output.';strand='.$strand.';t='.$_.';';



                $ensembl_url.="param=cdna;" if($cdna);
                $ensembl_url.="param=coding;" if $coding;
                $ensembl_url.= "param=peptide;" if  $peptide;
                $ensembl_url.="param=utr5;"  if $utr5;
                $ensembl_url.="param=utr3;"  if $utr3;
                $ensembl_url.="param=exon;" if $exon;
                $ensembl_url.="param=intron;" if $intron;

                $ensembl_url.='genomic=off;_format=Text';
                #print "$ensembl_url\n";
                my $content_ensembl_seq = get $ensembl_url;
                print "$content_ensembl_seq\n";
            }    

      } 
      else {
        print "!!!ORG CODE ERROR!!! : $_\n";
      }    
print "//\n";        
}