I have completed a blastx run on my samples and have obtained the following result (example):
$head blastx_result.txt
NS500162:172:HG5CJBGXX:1:11101:2522 ZWIP2_ARATH 52.500 40 19 0 2 121 25 64 8.26e-07 44.3
I would like to take the ACC_ID number, in this case, ZWIP2_ARATH and find the taxonomic information for this. I have many thousands of such ACC_ID numbers to map to taxonomy. This webite: Programmatic access - Mapping database identifiers explains, with example code, on how to use the Retrieve/ID mapping programmatically.
Here, you put in your ACC_ID number, Under 'Select options' From: UniProtKB AC/ID to: UniProtKB and like magic it gives a file that you can download with columns such as: EntryName, ProteinName, Organism, etc.
I am trying to obtain the exact same result, but programmatically (command line) due to my large numbers of ACC_IDs (well above the website limit, even if I split my ACC_IDs to many small files - I'll have hundreds of files to put in manually, which isn't practical).
Here is my script:
use strict;
use warnings 'all';
use LWP::UserAgent;
my @files = glob 'ACC_*'; # Files containg list of UniProt IDs.
my $base = 'http://www.uniprot.org';
my $tool = 'uploadlists';
my $contact = ''; # Please set your email address here
# to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';
for my $file ( @files ) {
my $response = $agent->post(
"$base/$tool/",
Content_Type => 'form-data',
Content => [
file => [ $file ],
format => 'tab',
from => 'ACC+ID',
to => 'ID',
columns => 'entryname,proteinname,genename,organism',
],
);
while ( my $wait = $response->header('Retry-After') ) {
print STDERR "Waiting ($wait)...\n";
sleep $wait;
$response = $agent->get($response->base);
}
if ( $response->is_success ) {
print $response->content;
}
else {
die sprintf "Failed. Got %s for %s\n",
$response->request->uri,
$response->status_line;
}
}
According to the link above (Mapping db identifiers), the code for ACC_ID is ACC+ID and that is what is put in as the 'from" field in the above code however; the 'to' field in the Retreive/ID Mapping website is UniProtKB, which there isn't an appropriate code for in the Mapping db identifiers link. So, I put in 'ID' and all this code does is return me two columns and both have my ACC_IDs side by side.
Any suggestions or clues on what I may be missing to get this to return actual taxonomy information? I'm hoping my final result would look something like this:
Entry Entry name Protein names Gene names Organism
Q9SVY1 ZWIP2_ARATH Zinc finger protein WIP-domain protein 2 Arabidopsis thaliana
Thank you TONS!
Added Comment: I have written a similar question here: Retrieving Taxonomy from Uniprot/Swissprot ACC_ID From Blastx Results and this inquiry has as it's answer how I can retrieve taxonomic information for one ACC_ID at a time.
please validate (green mark on the left) or comment your previous question: Retrieving Taxonomy from Uniprot/Swissprot ACC_ID From Blastx Results
Pierre: My apologies. I have just added the link to that particular question to this question.