Using UniProt's Retrieve/ID mapping service programmatically
1
1
Entering edit mode
7.6 years ago
ladypurrsia ▴ 60

I have just completed a blastx run on my samples and have obtained the following result (example):

$head blastx_result.txt

NS500162:172:HG5CJBGXX:1:11101:25222 Y052L_FRG3G 52.500 40 19 0 2 121 25 64 8.26e-07 44.3

The second column has UniProtKB AC/ID that I need to change to it's respective KO number. I am aware of the Retrieve/ID mapping tool where I can manually select: From UniProtKB AC/ID To: KO and get the associated KO with this ID. This option also allows you to upload a text file with many AC/IDs, but they have a limit of 100,000 IDs that you can put in. I have 16 total files with several million AC/IDs in each file that I need converted to KOs. Splitting these 16 files to 100k small files gives me over 2,000 files to manually put into this tool. This is overwhelming and not practical.

Uniprot also has the following website: How can I access resources on this web site programmatically? where they have sample scripts to use to access this site programmatically. I am not a coder but chose the Perl script they provided in an attempt to do the ID transfer (under Mapping database identifiers of that site). Here is the bit of code I am trying to work with:

$cat uniprot.py

import urllib,urllib2
url = 'http://www.uniprot.org/uploadlists/'
params = {
'from':'ACC+ID',
'to':'KO_ID',
'format':'tab',
'query':'052L_FRG3G 14332_ORYSJ 1A111_ARATH 1A13_SOLLC 1A16_ARATH'
 }
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "myemail@gmail.com" # Please set your email address here to help us debug in case of problems.
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read(200000)

Under 'query' I used test AC/IDs that I know give back KO numbers, however; running this script on my terminal:

perl ./uniprot.py

produced zero results.

My inquiry is this:

1) What am I doing wrong with this code?

2) How can I put in a .txt with millions of AC/IDs (one for each line) within this code so that it returns the KO numbers for those IDs?

A million thanks!

blast uniprot KEGG Retrieve/ID mapping • 7.8k views
ADD COMMENT
1
Entering edit mode

UniProt provides ID mappings in a single text file (you can download it from here: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/).

ADD REPLY
1
Entering edit mode
7.6 years ago

Hello (are you happy with the reply you obtained from the UniProt helpdesk?)

It seems like you used the python script and not the one in PERL.

Here is the PERL script, modified for your use case:

use strict;
use warnings;
use LWP::UserAgent;

my $base = 'http://www.uniprot.org';
my $tool = 'uploadlists';

my $params = {
      from => 'ACC+ID',
        to => 'KO_ID',
          format => 'tab',
            query => '052L_FRG3G 14332_ORYSJ 1A111_ARATH 1A13_SOLLC 1A16_ARATH'
};

my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';

my $response = $agent->post("$base/$tool/", $params);

while (my $wait = $response->header('Retry-After')) {
      print STDERR "Waiting ($wait)...\n";
        sleep $wait;
          $response = $agent->get($response->base);
}

$response->is_success ?
  print $response->content :
    die 'Failed, got ' . $response->status_line .
        ' for ' . $response->request->uri . "\n";

and here is what it returns

perl ./idmapping_to_ko.pl
From    To
052L_FRG3G  K12408
14332_ORYSJ K06630
1A111_ARATH K01762
1A13_SOLLC  K01762
1A16_ARATH  K20772

This code example takes a file as input:

http://www.uniprot.org/help/programmatic_access#batch_retrieval_of_entries

And you can actually also use the batch retrieval service programmatically, and have it return tab-separated output with a KO column:

use warnings;
use LWP::UserAgent;

my $list = $ARGV[0]; # File containg list of UniProt identifiers.

my $base = 'http://www.uniprot.org';
my $tool = 'uploadlists';

my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';

my $response = $agent->post("$base/$tool/",
      [ 'file' => [$list],
        'format' => 'tab',
        'from' => 'ACC+ID',
        'to' => 'ACC',
        'columns' => 'id,database(ko)',
      ],
      'Content_Type' => 'form-data');

while (my $wait = $response->header('Retry-After')) {
      print STDERR "Waiting ($wait)...\n";
        sleep $wait;
          $response = $agent->get($response->base);
}

$response->is_success ?
  print $response->content :
    die 'Failed, got ' . $response->status_line .
        ' for ' . $response->request->uri . "\n";
ADD COMMENT
0
Entering edit mode

Dear Elisabeth:

You are a savior!! Yes; I'm pouring over that e-mail too. Insanely informative.

This script worked for a small subset. For some reason, if I have a file that has > 40,000 blastx Identifiers, it kicks back this error:

Failed, got 500 Server closed connection without sending any data back for http://www.uniprot.org/uploadlists/

So, I have to split my $list input file that I am putting into this script into chunks of ~20k identifiers. No problem, but in order to avoid writing a script for each file name (and getting carpal tunnel after a week bc I will have to write > 2,000 file names into the my $list = line) I have dedicated my day to figuring out a way to have this script read all of the split files at once (kept in 1 directory) and naming them differently once its done finding the KOs. Something like spl_1KO.txt; spl_2KO.txt (so I can cat them all later by name and have one complete set for that sample).

I have found numerous websites on looping perl scripts, and have imagined how far this computer can be thrown out the window today by trying multiple subsets of these functions:

for f in file_*.txt; do script.pl "$f" > "${f/file_/output_}";
my @files = <*.txt>; for $file (@files) {

but each of these return word salads or errors or it runs indefinitely. I know in my infinite naive-ness of perl, I cannot figure out how to do this. Any help in this dark rabbit hole of a paradox I'm in would be super duper awesome!!

Sincerely, Joany

ADD REPLY

Login before adding your answer.

Traffic: 1599 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6