Entering edit mode
21 months ago
marine.bergot
•
0
hi! I have some issue with requests i'm doing with Entrez. I'm trying to get informations with bioprojectID inside bioproject database but it seems to work when it wants.. Inside for loop i get :
Entrez.efetch(db="bioproject", retmode="xml", id=bio_id)305084
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 8, in get_publication_infos
File "/Users/cea/miniconda3/lib/python3.10/site-packages/Bio/Entrez/__init__.py", line 196, in efetch
return _open(request)
File "/Users/cea/miniconda3/lib/python3.10/site-packages/Bio/Entrez/__init__.py", line 586, in _open
handle = urlopen(request)
File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/Users/cea/miniconda3/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
and outisde for loop for the same bio_id :
handle = Entrez.efetch(db="bioproject", retmode="xml", id=305084)
bio_file = handle.read()
soup = BS.BeautifulSoup(bio_file, 'xml')
and i can print soup with no problem i guess there is problem with my way of managing the API requests? if someone has any idea.
This is my full code :
from Bio import Entrez
import bs4 as BS
import lxml
import ipdb
Entrez.email = "XXXXX"
Entrez.api_key ="XXXXXX"
def get_ids_bioproject(IDs):
#return a list of all bioproject ids related to assemblies
bio_ids = set()
dict_bioproject_assembly = list()
for ID in IDs:
esummary_handle = Entrez.esummary(db="assembly", id=ID, report="full")
esummary_record = Entrez.read(esummary_handle)
bio_id = esummary_record['DocumentSummarySet']['DocumentSummary'][0]['GB_BioProjects'][0]['BioprojectId']
#dict_bioproject_assembly[ID] = bio_id
bio_ids.add(bio_id)
return(dict_bioproject_assembly, bio_ids)
def get_publication_infos(bioproject_ids):
#return a dict with information about publication related to assembly through bioproject id
dict_info_journal = dict()
list_odd_blank_assembly = list()
for bio_id in bioproject_ids:
print(bio_id)
print('Entrez.efetch(db="bioproject", retmode="xml", id=bio_id)' +bio_id)
handle = Entrez.efetch(db="bioproject", retmode="xml", id=bio_id)
bio_file = handle.read()
soup = BS.BeautifulSoup(bio_file, 'xml')
print(soup)
if soup.find_all('Publication') == list():
list_odd_blank_assembly.append(bio_id)
continue
else:
if len(soup.find_all('Title')) < 2:
list_odd_blank_assembly.append(bio_id)
continue
else:
dict_info_journal[bio_id] = dict()
dict_info_journal[bio_id]['Title'] = soup.find_all('Title')[1].string
dict_info_journal[bio_id]['Journal'] = soup.JournalTitle.string
dict_info_journal[bio_id]['Author'] = soup.Last.string+" et al."
dict_info_journal[bio_id]['Year'] = soup.Year.string
dict_info_journal[bio_id]['Pubmed'] = "https://pubmed.ncbi.nlm.nih.gov/"+soup.find("Publication")['id']
handle.close()
return(dict_info_journal,list_odd_blank_assembly)
query = "Microbacterium[Organism] AND latest_refseq[filter] NOT partial[filter]"
handle = Entrez.esearch(term=query, db="Assembly", retmax=900)
IDs = Entrez.read(handle)["IdList"]
(dict_bioproject_assembly, bioproject_ids) = get_ids_bioproject(IDs)
(dict_info_journal, list_odd_blank_assembly) = get_publication_infos(bioproject_ids)
thanks for your help!
(I'm using python 3.10.9 and version 1.80 of biopython)
NCBI is a public resource and when doing large queries against it please use a pause/sleep section in your code. I see that you are using NCBI API key but that also has a limit of queries per unit time.
That ID does not seem to have any publication with it.
yeah but my request is against bioproject db not assembly db i guess that's why. as you can see in my exemple i'm able to find it outside of my loop. i can just use sleep() function ?
Ok I have corrected the search database above. There appears to be no publication associated with this ID. I understand that NCBI will add publication (in case submitters do not do it) if it finds one for a particular SRA accession but this process is likely not perfect.
Use some way to put a delay in between your queries.
sleep()
would be fine.well i'm not saying i'm finding paper for this request, but just i don't have error 400. This is my ouput from diffrent terminal :
and after parsing that i can say that yes there is no publication. My question is why in one terminal it's working and in other i have Error 400 ? i tried with time.sleep(5) after each request, the result is the same :(
Mysteries of internet. In general EntreDirect does not seem to have good way of handling errors when they occur during piping operations.