I am trying to download FASTA sequences from a list of protein GIs (~ 100000). I planned to use EPost using HTTP POST to first upload the list of GIs and then use EFetch to download the FASTA.
I am getting no response from the server (WebEnv and query_key are not generated) when I upload the list of GIs with EPost using HTTP POST.
The code that I am using for EPost is :
#!/usr/bin/perl
use LWP::Simple;
use LWP::UserAgent;
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "epost.fcgi";
$url_params = "db=protein&id=830003112&id=830003110&id=830003108&id=830003106&id=830003104&id=830003102&id=830003100&id=830003098&id=830003096&id=830003094&id=830003092&id=830003090";
#create HTTP user agent
$ua = new LWP::UserAgent;
#create HTTP request object
$req = new HTTP::Request POST => "$url";
$req->content_type('application/x-www-form-urlencoded');
$req->content("$url_params");
#post the HTTP request
$response = $ua->request($req);
print $response->content;
This prints nothing after the code is run. Ideally it should print the WebEnv and the query_key. The HTTP status is OK and the code is 200.
If I change the url_params and remove all GIs from it
$url_params = "db=protein";
I get the following output :
<?xml version="1.0"?>
<!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">
<ePostResult>
<ERROR>Empty ID list; Nothing to store</ERROR>
</ePostResult>
I have no idea what the problem is and why the server isn't generating WebEnv and query_key.
If anyone knows the solution please help me out.
The comma separated list is for EPost using HTTP GET, which has a limit of 200 GIs per query. I want to query a million GIs and for that HTTP POST should be used. Please read :- http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_ (in required parameters - id).
In HTTP POST query each id is separated by
&id=
. This is mentioned in the sample application 4 :- http://www.ncbi.nlm.nih.gov/books/NBK25498/#_chapter3_Application_4_Finding_unique_se_There is no difference in the format of the data between a GET and a POST request. For some reasons the NCBI server expects a comma separated list for id, but fails if
&id=
is given multiple times. Maybe the NCBI server is not in accordance with the CGI standard.I guess that the problem is with Perl. Please pass your comma-separated GI list as a plain string to
$req
instance. If you pass them as a list or an associative array/hash, Perl will presumably mangle them into multi&id=
.You are right, NCBI does not accept
&id=
. It worked as soon as I converted it to comma separated list.It seems they have given incorrect code in their sample application 4 for HTTP POST.
Thanks a lot!
I'd be curious to know if the system itself will allow you to pass 1 million GIs into a single entrez post request (once you figure out how to generate right format) - but let us know
I have successfully passed 100,000 GIs with one POST request. Will try 1 million soon.
I'm kinda surprised to (still) see this working ... didn't NCBI switched to https protocol some time ago? I would have thought the http approach would have been phased out by now