My problem is the following: I have a list of GI identifiers form the NCBI nucleotide database. For instance take just this one: `76365841`. I want to extract the "isolation source" term from it. The answer here is "Everglades wetlands" which you can see by using the "efetch".
However when I hit a full chromosome that has a huge sequence, my program will download the full sequence and the biopython Entrez.parser is unable to handle that. For instance with: `332640072`
Is there any way of building a request to NCBI to batch download the sequences information (including isolation source) WITHOUT downloading the actual sequence in terms of AGTC.
If you want to see the program:
#python
from Bio import Entrez
gis = ['332640072', '76365841', '22506766', '389043336']
response = Entrez.efetch(db="nucleotide", id=gis, retmode="xml")
records = list(Entrez.parse(response, validate=True))
Not all sequences have that information, but for example with Entrez direct:
epost -db nuccore -id 22506766,332640072,76365841,389043336 | efetch -format docsum | xtract -element SubName | tr "\t" "\n"
3
PB131|Panama: Panama Province, Las Cumbres Lake|lake water at 5 m depth during dry season|9.0986 N 79.5392 W
F124|USA: Florida|Everglades wetlands
SFD1-19|USA: San Francisco Delta, Mildred Island 2000-07-20
I'm not so sure the SubName element is standard though.
Thanks for the answer ! But your solution still downloads the whole >10MB sequence.
It most certainly does not download the whole sequence
I wanted to get only the meta-information