Hi!
This is a problem I keep coming back to and I would really like to finally solve it.
I'm having some trouble retrieving nucleotide sequence files from NCBI. More specifically, the query results I'm getting don't seem to be returning all the relevant data that's available.
I am using BioPython to submit Entrez queries for genes of interest (eg. rbcL) for taxa under a phyla (gymnosperm), and retrieve all relevant files for species which the sequenced gene of interest is available. These files may be partial sequences, whole chromosome sequences, whole genome sequences, etc (I process these afterwards using the file's annotations to extract the part I'm interested in).
Example of code for Entrez query to retrieve data for cox1:
db = 'nucleotide'
families =['Cycadaceae','Zamiaceae','Ginkgoaceae','Welwitschiaceae','Gnetaceae','Ephedraceae','Pinaceae','Araucariaceae','Podocarpaceae','Sciadopityaceae','Cupressaceae','Taxaceae']
for family in families:
# generate query to Entrez eSearch
eSearch = Entrez.esearch(db=db, term='('+family+'[Organism] AND cox1) OR '
+'('+family+'[Organism] AND coxI) OR '
+'('+family+'[Organism] AND coI)')
res = Entrez.read(eSearch,'genbank')
for id in res["IdList"]:
handle=(Entrez.efetch(db="nucleotide", id=id,rettype="gb", retmode="text"))
record = SeqIO.read(handle, "genbank")
SeqIO.write(record,'./partial_mt/'+id+"_"+family+"_"+record.annotations['organism']+".gb",'genbank')
handle.close()
I thought I was having pretty good results until I started to notice I wasn't getting data for species that I would expect to be available. Searching manually on NCBI's website, I confirmed that relevant files were available for these species, but they would only appear in results if the query was specified to the species.
For example, there is a complete chloroplast genome sequence file for Juniperus formosana [KX832625.1]. This file is only returned when the full binomial nomenclature is used in the query (eg. "Juniperus formosana"[Organism] OR juniperus formosana[All Fields]) AND chloroplast[All Fields]").
Querying broader taxonomic terms does not return KX832625.1 or any similar complete chloroplast genome sequences for this species. For example:
"juniperus"[Organism] OR juniperus[All Fields]) AND chloroplast[All Fields]
"cupressaceae"[Organism] OR cupressaceae[All Fields]) AND chloroplast[All Fields]
The same applies to querying individual genes (eg. rbcL), rather than "chloroplast".
I am having similar trouble with species in other groups (eg. Pinaceae, which should be very well represented), as well.
That said, I am still retrieving a large number of relevant files for many species, but these illusive ones are proving to be important.
Is there some way I can get more exhaustive data retrieval covering more of the relevant files in the database without an exhaustive list of all possible species?
Thank you!
In my observations, once the result sets are above a certain size, the returned results can become flaky - that is the results at the command line don't match the web results exactly. Not sure why that is and how to fix it, but it is something that many people have observed and reported on here
I would reach out to ncbi support email, and see if they have a recommendation.