Question

Trying to work with the NCBI's Entrez api using python.

0

Entering edit mode

11 months ago

gdizzle12 • 0

Hello, I'm currently working with biopython's 'Entrez' library and finding it very frustrating and lacking proper documentation, I'm just trying to find all the sequencing data for lacY in e.coli and download it into SeqIO.

https://www.ncbi.nlm.nih.gov/gene/949083

from Bio import Entrez, SeqIO
# Search for e.coli lacY gene and find id's (GenBank Accession Numbers)
handle = Entrez.esearch(db='nucleotide', retmax=10, term='Escherichia coli[Orgn] AND lacY[Gene]') 
record = Entrez.read(handle)
id_list = record['IdList']
# Efetch genbank data
handle1 = Entrez.efetch(db='nucleotide', id=id_list[0], rettype='gb', retmode='text')
print(handle1.read())
record1 = SeqIO.read(handle1, "genbank")
print(record1.seq)
# Error occurs, "ValueError: No records found in handle"

The library is able to download the id_lists, and printing out the handle read it's found the link provided, but it can't download the actually fasta data from it. I'm interested to know if anyone else has been able to solve this programmatically, I could always download the fasta files manually but this was only a test run for a larger project I'm working on.

Thanks!

biopython python NCBI • 1.5k views

ADD COMMENT • link updated 10 months ago by ninastanisic4 • 0 • written 11 months ago by gdizzle12 • 0

score 1 · Answer 1 · 2024-01-03

1

Entering edit mode

11 months ago

Philipp Bayer 8.8k

You throw away your results in this line:

print(handle1.read())

handle1 is a generator, so the next time you call handle1.read(), you get nothing back (empty string). It's designed this way in case you get millions of sequences back, it reads only one at a time.

Do this if you're sure it's only one sequence:

from io import StringIO

handle1 = Entrez.efetch(db='nucleotide', id=id_list[0], rettype='gb', retmode='text')
res = handle1.read()
record1 = SeqIO.read(StringIO(res), 'genbank')

Edit:

but it's probably multiple sequences:

handle1 = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb', retmode='text')
for res in handle1:
    record1 = SeqIO.read(StringIO(res), 'genbank')
    # do stuff like SeqIO.write with the record, or analyse in some other way

ADD COMMENT • link 11 months ago by Philipp Bayer 8.8k

0

Entering edit mode

Hmm I'm still having issues finding fasta info related to this link lacY lactose permease, is there something weird about the way the ncbi accesses databases? The link contains 'gene' so when I esearch for it I get the correct id in my list, but once I use efetch with that id it fails - is there a super secret id that efetch actually uses to find fasta info? My assumptions are that 1. I've actually picked a very weird edge case where the data I'm looking for is actually in another database, and trying to access it using it's gene id fails because it's not actually there. 2. I'm missing some crucial info about how the efetch api actually works and I'm using it very wrong. I'll keep looking through the EUtils documentation, but it seems pretty vague on these details. Thanks for the help!

handle = Entrez.efetch(db='gene' ,id='949083', rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
#----> 2 record = SeqIO.read(handle, "gb")
# ValueError: No records found in handle
handle.close()

ADD REPLY • link 11 months ago by gdizzle12 • 0

0

Entering edit mode

Hi, I came acorss very same problem. Did you myb find a soulution, or you have a follow up?

ADD REPLY • link 10 months ago by ninastanisic4 • 0