Retrieve Amino Acid Sequence From Mrna Accession Number
2
0
Entering edit mode
12.2 years ago
prajwalnj ▴ 10

Hello,

I have a set of 500 mRNA accession numbers

NM_000247
NM_000500
NM_000694
NM_000947
...

and I would like to retrieve the aa sequence.

>NM_000247 
MGLGPVFLLLAGIFPFAPPGAAAEPHSLRYNLTVLSWDGSVQSG
FLTEVHLDGQPFLRCDRQKCRAKPQGQWAEDVLGNKTWDRETRDLTGNGKDLRMTLAH
IKDQKEGLHSLQEIRVCEIHEDNSTRSSQHFYYDGELFLSQNLETKEWTMPQSSRAQT
LAMNVRNFLKEDAMKTKTHYHAMHADCLQELRRYLKSGVVLRRTVPPMVNVTRSEASE....

I used this script

use LWP::Simple;
use URI::URL;

if(@ARGV != 3) {
  print "Usage: perl test.pl < database > < id > < your e-mail >\n";
exit(0);
}

$database = $ARGV[0];

$id = $ARGV[1];
$email = $ARGV[2];
$address = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi";

$parameter = {"db" => $database,
            "id" => $id,
            "retmode" => "text",
            "rettype" => "gp",
            "email" => $email};

$url = url($address);
$url->query_form($parameter);

$result = get($url);
print $result;

But this is possible for a single id at a time and gives me a lot more information. How can I upload a list and retrieve only the aa sequence store the results in a file ?

Thank you in advance,

Prajwal

script • 3.7k views
ADD COMMENT
5
Entering edit mode
12.2 years ago

try to use XSLT to extract the protein from the genbank/xml record:

$ echo -e "NM_000500\nNM_000694\nNM_000947" | while read ACN ; do curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${ACN}&retmode=xml" | xsltproc --novalid stylesheet.xsl -  ; done
>NP_000491.4
MLLLGLLLLLPLLAGARLLWNWWKLRSLHLPPLAPGFLHLLQPDLPIYLLGLTQKFGPIYRLHLGLQDVVVLNSKRTIEEAMVKKWADFAGRPEPLTYKLVSRNYPDLSLGDYSLLWKAHKKLTRSALLLGIRDSMEPVVEQLTQEFCERMRAQPGTPVAIEEEFSLLTCSIICYLTFGDKIKDDNLMPAYYKCIQEVLKTWSHWSIQIVDVIPFLRFFPNPGLRRLKQAIEKRDHIVEMQLRQHKESLVAGQWRDMMDYMLQGVAQPSMEEGSGQLLEGHVHMAAVDLLIGGTETTANTLSWAVVFLLHHPEIQQRLQEELDHELGPGASSSRVPYKDRARLPLLNATIAEVLRLRPVVPLALPHRTTRPSSISGYDIPEGTVIIPNLQGAHLDETVWERPHEFWPDRFLEPGKNSRALAFGCGARVCLGEPLARLELFVVLTRLLQAFTLLPSGDALPSLQPLPHCSVILKMQPFQVRLQPRGMGAHSPGQSQ
>NP_000685.1
MDPLGDTLRRLREAFHAGRTRPAEFRAAQLQGLGRFLQENKQLLHDALAQDLHKSAFESEVSEVAISQGEVTLALRNLRAWMKDERVPKNLATQLDSAFIRKEPFGLVLIIAPWNYPLNLTLVPLVGALAAGNCVVLKPSEISKNVEKILAEVLPQYVDQSCFAVVLGGPQETGQLLEHRFDYIFFTGSPRVGKIVMTAAAKHLTPVTLELGGKNPCYVDDNCDPQTVANRVAWFRYFNAGQTCVAPDYVLCSPEMQERLLPALQSTITRFYGDDPQSSPNLGRIINQKQFQRLRALLGCGRVAIGGQSDESDRYIAPTVLVDVQEMEPVMQEEIFGPILPIVNVQSLDEAIEFINRREKPLALYAFSNSSQVVKRVLTQTSSGGFCGNDGFMHMTLASLPFGGVGASGMGRYHGKFSFDTFSHHRACLLRSPGMEKLNALRYPPQSPRRLRMLLVAMEAQGCSCTLL
>NP_000938.2
MEFSGRKWRKLRLAGDQRNASYPHCLQFYLQPPSENISLIEFENLAIDRVKLLKSVENLGVSYVKGTEQYQSKLESELRKLKFSYRENLEDEYEPRRRDHISHFILRLAYCQSEELRRWFIQQEMDLLRFRFSILPKDKIQDFLKDSQLQFEAISDEEKTLREQEIVASSPSLSGLKLGFESIYKIPFADALDLFRGRKVYLEDGFAYVPLKDIVAIILNEFRAKLSKALALTARSLPAVQSDERLQPLLNHLSHSYTGQDYSTQGNVGKISLDQIDLLSTKSFPPCMRQLHKALRENHHLRHGGRMQYGLFLKGIGLTLEQALQFWKQEFIKGKMDPDKFDKGYSYNIRHSFGKEGKRTDYTPFSCLKIILSNPPSQGDYHGCPFRHSDPELLKQKLQSYKISPGGISQILDLVKGTHYQVACQKYFEMIHNVDDCGFSLNHPNQFFCESQRILNGGKDIKKEPIQPETPQPKPSVQKTKDASSALASLNSSLEMDMEGLEDYFSEDS

with stylesheet.xsl :


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.0' >


<xsl:output method="text" encoding="UTF-8"/>


<xsl:template match="/">
<xsl:for-each select="//GBQualifier[GBQualifier_name='translation'][GBQualifier_value]">
<xsl:text>></xsl:text>
<xsl:choose>
  <xsl:when test="../GBQualifier[GBQualifier_name='protein_id']">
    <xsl:value-of select="../GBQualifier[GBQualifier_name='protein_id']/GBQualifier_value"/>
  </xsl:when>
  <xsl:when test="../GBQualifier[GBQualifier_name='product']">
    <xsl:value-of select="../GBQualifier[GBQualifier_name='product']/GBQualifier_value"/>
  </xsl:when>
  <xsl:otherwise>
    <xsl:value-of select="generate-id(.)"/>
  </xsl:otherwise>
</xsl:choose>
<xsl:text>
</xsl:text>
<xsl:value-of select="GBQualifier_value"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>


</xsl:stylesheet>
ADD COMMENT
0
Entering edit mode

Thanks a lot Pierre, this has worked for me!

ADD REPLY
0
Entering edit mode
12.2 years ago
shane.neeley ▴ 50

I know your question is answered, but I thought I would post this script for others interested in doing iterative sequence retrievals. You can try this BioPerl E-utils script. This example is the same as searching for 'crab' in the protein database, but it will save all sequences it finds.

########## <http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook> #########

#!/usr/bin/perl -w

BEGIN {push @INC,"path/to/BioPerl";}
use Bio::DB::EUtilities;
# set optional history queue
my $factory = Bio::DB::EUtilities->new(-eutil      => 'esearch',
                                       -email      => 'mymail@foo.bar',
                                       -db         => 'protein',
                                       -term       => 'crab',
                                       -usehistory => 'y');

my $count = $factory->get_count;
# get history from queue
my $hist  = $factory->next_History || die 'No history data returned';
print "History returned\n";
# note db carries over from above
$factory->set_parameters(-eutil   => 'efetch',
                         -rettype => 'fasta',
                         -history => $hist);

my $retry = 0;
my ($retmax, $retstart) = (500,0);

open (my $out, '>', 'lots_of_crab_sequences.fa') || die "Can't open file:$!";

RETRIEVE_SEQS:
while ($retstart < $count) {
    $factory->set_parameters(-retmax   => $retmax,
                             -retstart => $retstart);
    eval{
        $factory->get_Response(-cb => sub {my ($data) = @_; print $out $data} );
    };
    if ($@) {
        die "Server error: $@.  Try again later" if $retry == 5;
        print STDERR "Server error, redo #$retry\n";
        $retry++ && redo RETRIEVE_SEQS;
    }
    #say "Retrieved $retstart";
    $retstart += $retmax;
}

close $out;
ADD COMMENT

Login before adding your answer.

Traffic: 1901 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6