Question

Xml Returned Using Efetch Differs To That Downloaded From Query At Pubmed Website

1

Entering edit mode

12.5 years ago

Neilfws 49k

If I go to PubMed and enter this query, which currently returns around 2290 results:

"Retraction of Publication"[Publication Type]

then select "Send to File", format = XML, "Create File", the download generally takes a few seconds and returns a file with only one DOCTYPE line, as expected:

grep -c DOCTYPE ~/Downloads/pubmed_result.xml
# 1
grep DOCTYPE ~/Downloads/pubmed_result.xml
# http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_120101.dtd">

If I perform an equivalent query using the BioRuby implementation of EUtils:

require "rubygems"
require "bio"

ncbi    = Bio::NCBI::REST.new
Bio::NCBI.default_email = "me@me.com"

retmax = ncbi.esearch_count("Retraction of Publication[ptyp]", {"db" => "pubmed"})
search = ncbi.esearch("Retraction of Publication[ptyp]", {"db" => "pubmed", "retmax" => retmax})
result = ncbi.efetch(search, {"db" => "pubmed", "retmode" => "xml"})

File.open("pubmed_result.xml", "w") do |f|
  f.write(result)
end

it takes significantly longer to return the XML file, the file has a slightly different size and it contains multiple DOCTYPE lines which breaks XML parsing:

grep -c DOCTYPE pubmed_result.xml
# 23

It appears that Efetch returns separate, complete XML "chunks" and concatenates them into one file. This does not occur if a smaller subset of the variable search, e.g. search[0..4] is passed to efetch. So:

Is this issue due to passing too many IDs to efetch?
Have other people observed it using other implementations of Eutils?
Can it be resolved using e.g. a POST query as suggested in the Eutils documentation?

pubmed eutils xml • 7.1k views

ADD COMMENT • link updated 12.5 years ago by wdiwdi ▴ 380 • written 12.5 years ago by Neilfws 49k

0

Entering edit mode

On further investigation: I forgot to pass retmax as a parameter to efetch. However, even when this is done, results seem to be returned in batches of 100. Does not occur when the Bioperl Eutils library is used, so may be a bug in (my version of?) BioRuby.

ADD REPLY • link 12.5 years ago by Neilfws 49k

0

Entering edit mode

On even further investigation: it seems that adding the parameter "step = retmax" to efetch solves the problem.

ADD REPLY • link 12.5 years ago by Neilfws 49k

score 3 · Answer 1 · 2012-07-04

3

Entering edit mode

12.5 years ago

Damian Kao 16k

Here is a thread with similar problems: http://comments.gmane.org/gmane.comp.python.bio.general/6962

EFetch doesn't seem to cope well with fetching large amounts of records at once. It looks like with your query, EFetch is breaking it up into chucks of 1,000 records. Maybe they've fixed it since then to automatically break up queries into chunks of 1,000 when the returned number is large?

It is interesting though that their web implementation doesn't have this problem. You would assume they use the same EFetch system for their web interface...

ADD COMMENT • link 12.5 years ago by Damian Kao 16k

0

Entering edit mode

Seems more like chunks of ~ 100 records? 23 DOCTYPE lines for 2290 records.

ADD REPLY • link 12.5 years ago by Neilfws 49k

score 1 · Answer 2 · 2012-07-04

1

Entering edit mode

12.5 years ago

Neilfws 49k

To answer my own question: efetch in BioRuby can take the parameter step= for maximum number of records to retrieve at one time.

So this line works in my code:

result = ncbi.efetch(search, {"db" => "pubmed", "retmode" => "xml"}, step = retmax)

Documented here.

ADD COMMENT • link 12.5 years ago by Neilfws 49k

score 1 · Answer 3 · 2012-07-04

I can add some evidence to support that it is the Ruby library not properly aggregating chunked results, and not a problem on the NCBI end. You can try this query out manually from the web interface:

First fetch http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=retraction%20of%20publication[pt]&retmax=2400&usehistory=1. That will give you a webenv and a query_key
Using those, go to a URL like http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&webenv=NCID_1....&query_key=1&retmode=xml

It does take significantly longer to download the XML this way, as you mentioned; so that's strange. But the resultant document, when it finally does arrive, only has one doctype declaration.

score 1 · Answer 4 · 2012-07-05

1

Entering edit mode

12.5 years ago

wdiwdi ▴ 380

Efetch does indeed have problems when the return data size gets too large (it's the total data size, not the record count), after which there may be timeouts between various internal components of the Eutils system which can lead to incomplete or corrupted results. For queries returning only UIDs in XML format, my experience is that the safe limit is about 2.5 mil records. So it is better to split the retrieval into chunks.

ADD COMMENT • link 12.5 years ago by wdiwdi ▴ 380

0

Entering edit mode

The issue here is not incomplete or corrupt data, just how the chunks are returned.