This is a lot of genes! The results XML file is about 1.8G. I found that if you pass all of the IDS to efetch at once, you get a server error.
So here's the strategy I came up with. You can use a shell script that uses elink to get the list of IDs, in an XML file; a perl script to extract just the result IDs and put them into a set of text files. The text files, each with a limited number (1000) of gene ids can then be sent to efetch, one at a time, to get the gene xml files. This results in about 42 separate xml files, each with info about 1000 genes. It may be that efetch will accept more than 1000 ids at once -- I didn't experiment to see what the practical limit was.
Here's the shell script:
#!/usr/bin/bash
# Fetch the list of uids:
curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=genome\&linkname=genome_gene\&id=5 \
> ids.xml
# Mung with perl to extract the list of gene IDs.
# This produces a bunch of ids-nnn.txt text files.
perl getids.pl ids.xml
# For each of these text files, invoke efetch.
for file in ids-*.txt ; do
outfile=${file%%.*}.xml
echo "Getting gene info for $file"
curl -X POST -d @$file \
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene\&retmode=xml \
> $outfile
done
Here's the Perl script that munges the ID results:
#!/usr/bin/perl
# The output will be split into separate files with this number of genes IDs in
# each
my $genesPerFile = 1000;
# We'll discard the first <Id> in the input file; this flag keeps track of that.
my $first = 1;
# Number the output files starting with 0, and keep track of the number of
# gene ids written so far.
my $outFileNum = 0;
my $geneNum = 0;
open OUT, ">ids-$outFileNum.txt";
# First print out the query parameter name and the equals sign
print OUT "id=";
# For each line of input
while (<>) {
# Get rid of the newline at the end of every line
chomp;
# We're only interested in lines that have <Id>
if (/\<Id\>/) {
# Discard the first one
if ($first) {
$first = 0;
}
else {
# Extract just the numeric value
s/^.*\<Id\>([0-9]+)\<\/Id\>.*$/$1/;
# Print it out with a trailing comma
print OUT $_ . ",";
# If we've reached the limit for this output file, close it
# and open the next one.
$geneNum++;
if ($geneNum == $genesPerFile) {
$geneNum = 0;
print OUT "\n";
close OUT;
$outFileNum++;
open OUT, ">ids-$outFileNum.txt";
print OUT "id=";
}
}
}
}
print OUT "\n";
close OUT;
Is there a way to request 42,876 genes with a single query? As the question suggests, I'm looking for a single file, not one for each gene.
I've updated my answer.
Can you explain what this means: "put each XML in a database" geneid2xml
which db would you suggest. ? What is the "geneid2xml"? It does not seem to be a database vendor.
that could be any SQL database with primary-key=gene-id and value=xml, or a XML-oriented database like eXist