Question

From Ensembl protein ID to sequence

2

Entering edit mode

9.6 years ago

Joseph Hughes ★ 3.0k

Hi,

I have a list of protein IDs from ensembl:

ENSMUSP00000137272
ENSMUSP00000137602
ENSRNOP00000057248
ENSRNOP00000057253
ENSMICP00000006596
ENSMICP00000013787
ENSTBEP00000002813
ENSTBEP00000003741
ENSTBEP00000004212

From Uniprot, you can do it using the following URL:

http://www.uniprot.org/uniprot/?query=ENSMUSP00000137272&format=fasta

Is there a way to retrieve the corresponding protein sequences from Ensembl without knowing which species they come from? Or using Biomart?

Thanks

ensembl protein-sequence • 3.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

The simple way might be to actually find out which species they come from. It is rather easy actually considering that the prefix of these ids are always ENS<Species Code>P<ID> (where P represent protein). So the simple way will be tokenize your list and find out what species was containing in your list and then use biomart to download the sequence of the corresponding species.

Some examples are:

MUS = mouse
RNO = Rat
TBE = Tupaia belangeri (Tree Shrew)

You can find the information here.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Sam ★ 4.8k

Ram · Answer 1 · 2015-04-09

2

Entering edit mode

9.6 years ago

Emily 24k

Have you tried the REST API? The GET sequence/id endpoint pulls out a sequence with just the ID. For example, http://rest.ensembl.org/sequence/id/ENSMUSP00000137272?content-type=text/x-fasta;type=protein.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Emily 24k

Ram · Answer 2 · 2015-04-09

pyGeno is also you friend. It does not require access to a REST API so it is more reliable and faster if you have a lot of proteins.

from pyGeno.Genome import *
ref = Genome(name = 'GRCh37.75')
prot = ref.get(Protein, id = 'ENSMUSP00000137272')[0]

And you also get all the information supplied by Ensembl for free: prot.gene.biotype, prot.transcript.sequence, prot.transcript.exons etc.

Ram · Answer 3 · 2015-04-09

Actually, once you know which adaptor to use, it is quite simple. Here's a perl script that does it for an input text file with protein identifiers on each line:

#!/usr/bin/env perl

########################################################################### 
# script to download all the protein sequences from a list of identifiers

use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::ApiVersion;
printf( "The API version used is %s\n", software_version() );

my $list=$ARGV[0];
print "Parsing IDs from $list\n";
open(LIST,"<$list")||die "Can't open $list\n";
my (@IDs);
while(<LIST>){
  chomp($_);
  push(@IDs,$_);
}

# Load the registry automatically
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
  -host=>'ensembldb.ensembl.org',
  -user=>'anonymous', 
);

open(PROT,">$list\_out.fa")||die "Can't open $list\_out.fa\n";

foreach my $ID (@IDs) {
  print PROT ">$ID\n";
  my $seqmember_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi','compara','SeqMember');
  # fetch a Member
  my $seqmember = $seqmember_adaptor->fetch_by_stable_id($ID);
  print PROT $seqmember->sequence(),"\n";

}