Question

NCBI Relevant files not being returned as query results. Trying to build exhaustive dataset.

0

Entering edit mode

19 months ago

chordata • 0

Hi!

This is a problem I keep coming back to and I would really like to finally solve it.

I'm having some trouble retrieving nucleotide sequence files from NCBI. More specifically, the query results I'm getting don't seem to be returning all the relevant data that's available.

I am using BioPython to submit Entrez queries for genes of interest (eg. rbcL) for taxa under a phyla (gymnosperm), and retrieve all relevant files for species which the sequenced gene of interest is available. These files may be partial sequences, whole chromosome sequences, whole genome sequences, etc (I process these afterwards using the file's annotations to extract the part I'm interested in).

Example of code for Entrez query to retrieve data for cox1:

db = 'nucleotide'

families =['Cycadaceae','Zamiaceae','Ginkgoaceae','Welwitschiaceae','Gnetaceae','Ephedraceae','Pinaceae','Araucariaceae','Podocarpaceae','Sciadopityaceae','Cupressaceae','Taxaceae']


for family in families:
    # generate query to Entrez eSearch
    eSearch = Entrez.esearch(db=db, term='('+family+'[Organism] AND cox1) OR '
    +'('+family+'[Organism] AND coxI) OR '
    +'('+family+'[Organism] AND coI)')
    res = Entrez.read(eSearch,'genbank')

    for id in res["IdList"]: 
        handle=(Entrez.efetch(db="nucleotide", id=id,rettype="gb", retmode="text"))
        record = SeqIO.read(handle, "genbank")


        SeqIO.write(record,'./partial_mt/'+id+"_"+family+"_"+record.annotations['organism']+".gb",'genbank')    

    handle.close()

I thought I was having pretty good results until I started to notice I wasn't getting data for species that I would expect to be available. Searching manually on NCBI's website, I confirmed that relevant files were available for these species, but they would only appear in results if the query was specified to the species.

For example, there is a complete chloroplast genome sequence file for Juniperus formosana [KX832625.1]. This file is only returned when the full binomial nomenclature is used in the query (eg. "Juniperus formosana"[Organism] OR juniperus formosana[All Fields]) AND chloroplast[All Fields]").

Querying broader taxonomic terms does not return KX832625.1 or any similar complete chloroplast genome sequences for this species. For example:

"juniperus"[Organism] OR juniperus[All Fields]) AND chloroplast[All Fields]

"cupressaceae"[Organism] OR cupressaceae[All Fields]) AND chloroplast[All Fields]

The same applies to querying individual genes (eg. rbcL), rather than "chloroplast".

I am having similar trouble with species in other groups (eg. Pinaceae, which should be very well represented), as well.

That said, I am still retrieving a large number of relevant files for many species, but these illusive ones are proving to be important.

Is there some way I can get more exhaustive data retrieval covering more of the relevant files in the database without an exhaustive list of all possible species?

Thank you!

NCBI Entrez BioPython • 766 views

ADD COMMENT • link updated 19 months ago by GenoMax 147k • written 19 months ago by chordata • 0

0

Entering edit mode

In my observations, once the result sets are above a certain size, the returned results can become flaky - that is the results at the command line don't match the web results exactly. Not sure why that is and how to fix it, but it is something that many people have observed and reported on here

I would reach out to ncbi support email, and see if they have a recommendation.

ADD REPLY • link 19 months ago by Istvan Albert 102k

score 2 · Answer 1 · 2023-04-18

Using minimum number of filters seems to work better in general. You should do any filtering you need on the results rather than depending on original searches (keep them broad).

If I do the following search I see the chloroplast genome as one of the two results.

$ esearch -db nuccore -query "Juniperus formosana [ORGN] AND chloroplast genome" 
<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>MCID_643eee91a6e1fc584a61f37d</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>2</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>


$ esearch -db nuccore -query "Juniperus formosana [ORGN] AND chloroplast genome" | esummary
    <?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE DocumentSummarySet>
    <DocumentSummarySet status="OK">
      <DocumentSummary>
        <Id>1143272258</Id>
        <Caption>KX832625</Caption>
        <Title>Juniperus formosana chloroplast, complete genome</Title>
        <Extra>gi|1143272258|gb|KX832625.1|</Extra>

Here are some additional examples (results truncated for space consideration)

$ esearch -db nuccore -query "juniperus [ORGN] AND chloroplast genome" | esummary | xtract -pattern DocumentSummary -element Caption,Title
NC_065034       Juniperus pingii chloroplast, complete genome
NC_062328       Juniperus przewalskii chloroplast, complete genome
NC_065032       Juniperus chinensis chloroplast, complete genome
NC_068784       Juniperus osteosperma chloroplast, complete genome
NC_065035       Juniperus procumbens chloroplast, complete genome
NC_065033       Juniperus gaussenii chloroplast, complete genome
NC_062329       Juniperus przewalskii subsp. pendula chloroplast, complete genome
NC_062083       Juniperus rigida chloroplast, complete genome
NC_061760       Juniperus seravschanica chloroplast, complete genome

One more

$ esearch -db nuccore -query "pinaceae AND chloroplast genome" | esummary | xtract -pattern DocumentSummary -element Caption,Title | head -10
NC_065459       Pinus rigida chloroplast, complete genome
NC_064363       Picea purpurea chloroplast, complete genome
NC_063591       Picea brachytyla chloroplast, complete genome
NC_062404       Pinus tabuliformis var. henryi voucher Ph_36 chloroplast, complete genome
NC_061650       Larix griffithii var. speciosa voucher 2017LYSLs01 chloroplast, complete genome
NC_061649       Larix potaninii var. australis voucher 2017LYSLp01 chloroplast, complete genome
NC_061647       Larix himalaica voucher 2017LYSLh01 chloroplast, complete genome
NC_065458       Pinus echinata chloroplast, complete genome
NC_061646       Larix griffithii voucher 2017LYSLg01 chloroplast, complete genome
NC_067769       Pinus caribaea chloroplast, complete genome