From Ensembl protein ID to sequence
3
2
Entering edit mode
9.6 years ago
Joseph Hughes ★ 3.0k

Hi,

I have a list of protein IDs from ensembl:

ENSMUSP00000137272
ENSMUSP00000137602
ENSRNOP00000057248
ENSRNOP00000057253
ENSMICP00000006596
ENSMICP00000013787
ENSTBEP00000002813
ENSTBEP00000003741
ENSTBEP00000004212

From Uniprot, you can do it using the following URL:

http://www.uniprot.org/uniprot/?query=ENSMUSP00000137272&format=fasta​

Is there a way to retrieve the corresponding protein sequences from Ensembl without knowing which species they come from? Or using Biomart?

Thanks

ensembl protein-sequence • 3.6k views
ADD COMMENT
0
Entering edit mode

The simple way might be to actually find out which species they come from. It is rather easy actually considering that the prefix of these ids are always ENS<Species Code>P<ID> (where P represent protein). So the simple way will be tokenize your list and find out what species was containing in your list and then use biomart to download the sequence of the corresponding species.

Some examples are:

  • MUS = mouse
  • RNO = Rat
  • TBE = Tupaia belangeri (Tree Shrew)

You can find the information here.

ADD REPLY
2
Entering edit mode
9.6 years ago
Emily 24k

Have you tried the REST API? The GET sequence/id endpoint pulls out a sequence with just the ID. For example, http://rest.ensembl.org/sequence/id/ENSMUSP00000137272?content-type=text/x-fasta;type=protein.

ADD COMMENT
1
Entering edit mode
9.6 years ago
Tariq Daouda ▴ 220

pyGeno is also you friend. It does not require access to a REST API so it is more reliable and faster if you have a lot of proteins.

from pyGeno.Genome import *
ref = Genome(name = 'GRCh37.75')
prot = ref.get(Protein, id = 'ENSMUSP00000137272')[0]

And you also get all the information supplied by Ensembl for free: prot.gene.biotype, prot.transcript.sequence, prot.transcript.exons etc.

ADD COMMENT
0
Entering edit mode
9.6 years ago
Joseph Hughes ★ 3.0k

Actually, once you know which adaptor to use, it is quite simple. Here's a perl script that does it for an input text file with protein identifiers on each line:

#!/usr/bin/env perl

########################################################################### 
# script to download all the protein sequences from a list of identifiers

use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::ApiVersion;
printf( "The API version used is %s\n", software_version() );

my $list=$ARGV[0];
print "Parsing IDs from $list\n";
open(LIST,"<$list")||die "Can't open $list\n";
my (@IDs);
while(<LIST>){
  chomp($_);
  push(@IDs,$_);
}

# Load the registry automatically
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
  -host=>'ensembldb.ensembl.org',
  -user=>'anonymous', 
);

open(PROT,">$list\_out.fa")||die "Can't open $list\_out.fa\n";

foreach my $ID (@IDs) {
  print PROT ">$ID\n";
  my $seqmember_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi','compara','SeqMember');
  # fetch a Member
  my $seqmember = $seqmember_adaptor->fetch_by_stable_id($ID);
  print PROT $seqmember->sequence(),"\n";

}
ADD COMMENT

Login before adding your answer.

Traffic: 2515 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6