If I go to PubMed and enter this query, which currently returns around 2290 results:
"Retraction of Publication"[Publication Type]
then select "Send to File", format = XML, "Create File", the download generally takes a few seconds and returns a file with only one DOCTYPE line, as expected:
grep -c DOCTYPE ~/Downloads/pubmed_result.xml
# 1
grep DOCTYPE ~/Downloads/pubmed_result.xml
# http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_120101.dtd">
If I perform an equivalent query using the BioRuby implementation of EUtils:
require "rubygems"
require "bio"
ncbi = Bio::NCBI::REST.new
Bio::NCBI.default_email = "me@me.com"
retmax = ncbi.esearch_count("Retraction of Publication[ptyp]", {"db" => "pubmed"})
search = ncbi.esearch("Retraction of Publication[ptyp]", {"db" => "pubmed", "retmax" => retmax})
result = ncbi.efetch(search, {"db" => "pubmed", "retmode" => "xml"})
File.open("pubmed_result.xml", "w") do |f|
f.write(result)
end
it takes significantly longer to return the XML file, the file has a slightly different size and it contains multiple DOCTYPE lines which breaks XML parsing:
grep -c DOCTYPE pubmed_result.xml
# 23
It appears that Efetch returns separate, complete XML "chunks" and concatenates them into one file. This does not occur if a smaller subset of the variable search, e.g. search[0..4] is passed to efetch. So:
- Is this issue due to passing too many IDs to efetch?
- Have other people observed it using other implementations of Eutils?
- Can it be resolved using e.g. a POST query as suggested in the Eutils documentation?
On further investigation: I forgot to pass retmax as a parameter to efetch. However, even when this is done, results seem to be returned in batches of 100. Does not occur when the Bioperl Eutils library is used, so may be a bug in (my version of?) BioRuby.
On even further investigation: it seems that adding the parameter "step = retmax" to efetch solves the problem.