Hi everyone!
I was trying to do something similar to Nick. I want to download GenBank results from 12S eukaryote. Since the normal website download does not work (I tried several times but I can get only 66% of the sequences) I wanted to try using the script Leszek posted. This is my code, slightly adjusted from Leszek:
enter code here#!/usr/bin/env python
import sys
from Bio import Entrez
from datetime import datetime
faa_output = "12S_eukaryotesGB.txt"
output_handle = open(faa_output, "w")
Entrez.email = "email@blabla.mail"
query='12S[All Fields] AND ("Eukaryota"[Organism] OR eukaryota[All Fields])'
db="nucleotide"
retmax=10**9
retmode='text'
rettype='gb'
batchSize=1000
#get list of entries for given query
sys.stderr.write( "Getting list of GIs for term=%s ...\n" % query )
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']
#print info about number of proteins
sys.stderr.write( "Downloading %s entries from NCBI %s database in batches of %s entries...\n" % ( len(giList),db,batchSize ) )
#post NCBI query
search_handle = Entrez.epost( db, id=",".join( giList ) )
search_results = Entrez.read( search_handle )
webenv,query_key = search_results["WebEnv"], search_results["QueryKey"]
#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
#print info
tnow = datetime.now()
sys.stderr.write( "\t%s\t%s / %s\n" % ( datetime.ctime(tnow),start,len(giList) ) )
handle = Entrez.efetch( db=db,retmode=retmode,rettype=rettype,retstart=start,retmax=batchSize,webenv=webenv,query_key=query_key )
#sys.stdout.write( handle.read() )
output_handle.write(handle.read())
output_handle.close()
print "Saved"
The script works great but after retrieving 4000/186000 results I get this error, which I do not understand:
Traceback (most recent call last): File
"script_epost_fetch_sequences.py", line 41, in <module>
output_handle.write(handle.read())
File "/usr/lib64/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/usr/lib64/python2.7/httplib.py", line 578, in read
return self._read_chunked(amt)
File "/usr/lib64/python2.7/httplib.py", line 632, in _read_chunked
raise IncompleteRead(''.join(value))
httplib.IncompleteRead: IncompleteRead(2450 bytes read)
Can anyone help me with this error?
did you try iterating over the resultset ids and fetching them one-by-one?
I am not sure NCBI would tolerate 1.3 million individual requests