Trying to work with the NCBI's Entrez api using python.
1
0
Entering edit mode
10 months ago
gdizzle12 • 0

Hello, I'm currently working with biopython's 'Entrez' library and finding it very frustrating and lacking proper documentation, I'm just trying to find all the sequencing data for lacY in e.coli and download it into SeqIO.

https://www.ncbi.nlm.nih.gov/gene/949083

from Bio import Entrez, SeqIO
# Search for e.coli lacY gene and find id's (GenBank Accession Numbers)
handle = Entrez.esearch(db='nucleotide', retmax=10, term='Escherichia coli[Orgn] AND lacY[Gene]') 
record = Entrez.read(handle)
id_list = record['IdList']
# Efetch genbank data
handle1 = Entrez.efetch(db='nucleotide', id=id_list[0], rettype='gb', retmode='text')
print(handle1.read())
record1 = SeqIO.read(handle1, "genbank")
print(record1.seq)
# Error occurs, "ValueError: No records found in handle"

The library is able to download the id_lists, and printing out the handle read it's found the link provided, but it can't download the actually fasta data from it. I'm interested to know if anyone else has been able to solve this programmatically, I could always download the fasta files manually but this was only a test run for a larger project I'm working on.

Thanks!

biopython python NCBI • 1.4k views
ADD COMMENT
1
Entering edit mode
10 months ago

You throw away your results in this line:

print(handle1.read())

handle1 is a generator, so the next time you call handle1.read(), you get nothing back (empty string). It's designed this way in case you get millions of sequences back, it reads only one at a time.

Do this if you're sure it's only one sequence:

from io import StringIO

handle1 = Entrez.efetch(db='nucleotide', id=id_list[0], rettype='gb', retmode='text')
res = handle1.read()
record1 = SeqIO.read(StringIO(res), 'genbank')

Edit:

but it's probably multiple sequences:

handle1 = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb', retmode='text')
for res in handle1:
    record1 = SeqIO.read(StringIO(res), 'genbank')
    # do stuff like SeqIO.write with the record, or analyse in some other way
ADD COMMENT
0
Entering edit mode

Hmm I'm still having issues finding fasta info related to this link lacY lactose permease, is there something weird about the way the ncbi accesses databases? The link contains 'gene' so when I esearch for it I get the correct id in my list, but once I use efetch with that id it fails - is there a super secret id that efetch actually uses to find fasta info? My assumptions are that 1. I've actually picked a very weird edge case where the data I'm looking for is actually in another database, and trying to access it using it's gene id fails because it's not actually there. 2. I'm missing some crucial info about how the efetch api actually works and I'm using it very wrong. I'll keep looking through the EUtils documentation, but it seems pretty vague on these details. Thanks for the help!

handle = Entrez.efetch(db='gene' ,id='949083', rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
#----> 2 record = SeqIO.read(handle, "gb")
# ValueError: No records found in handle
handle.close()
ADD REPLY
0
Entering edit mode

Hi, I came acorss very same problem. Did you myb find a soulution, or you have a follow up?

ADD REPLY

Login before adding your answer.

Traffic: 1936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6