Hi folks!
I have a list of NCBI assembly accession numbers, and I'm trying to return species name (latin and common) plus taxid. A few weeks ago, I wrote a bit of code that appeared to do the trick.
#!/usr/bin/env python
import csv
from Bio import Entrez
Entrez.email = xxx'
def get_organism_taxonomy(accession):
handle = Entrez.efetch(db='assembly', id=accession, rettype='xml')
record = Entrez.read(handle)
organism = record['DocumentSummarySet']['DocumentSummary'][0]['Organism']
taxonomy = record['DocumentSummarySet']['DocumentSummary'][0]['Taxid']
return organism, taxonomy
def update_csv(input_file, output_file):
with open(input_file, 'r') as csvfile:
reader = csv.reader(csvfile)
header = next(reader) # Read the header row
header += ['Organism', 'Taxonomy'] # Add new column headers
rows = []
for row in reader:
accession = row[0] # Assuming accession numbers are in the first column
print(accession)
organism, taxonomy = get_organism_taxonomy(accession)
row += [organism, taxonomy] # Append organism and taxonomy to the row
rows.append(row)
with open(output_file, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(header)
writer.writerows(rows)
nput_csv = '/lab/wengpj01/genomes/accessions_pt2.csv'
output_csv = '/lab/wengpj01/genomes/all_genomes_taxa_pt2.csv'
update_csv(input_csv, output_csv)
Now I'm returning to the same code, and it doesn't appear to be working anymore. I keep getting an error that says:
Traceback (most recent call last): File "./get_info.py", line 9, in <module>
handle=Entrez.efetch(db='assembly', id='GCA_000001405.29',rettype='xml') File "/usr/local/lib/python3.8/dist-packages/Bio/Entrez/__init__.py", line 207, in efetch
return _open(cgi, variables, post=post) File "/usr/local/lib/python3.8/dist-packages/Bio/Entrez/__init__.py", line 606, in _open
handle = urlopen(cgi) File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout) File "/usr/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response) File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error( File "/usr/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args) File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args) File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 400: Bad Request
This error happens anytime I set rettype="xml" even though I swear this worked 10 days ago. The only thing that I know I've changed is that I got an api key. I've tried setting that in the code, but it doesn't seem to help.
What's weird is that when I try other rettypes, I get trivial responses. For instance:
handle=Entrez.efetch(db='assembly', id='GCA_000001405.29')
record=Entrez.read(handle)
print(record)
Returns only:
['1405', '29']
rather than a full xml about the human genome.
Any advice about how I could get this working would be greatly appreciated! Thanks so much!
There must be some temporary glitch.
Following command works (via EntrezDirect):
You should probably not advertise your personal email. It is not needed to understand the code.
I don't know what exactly is the problem with your attempts, but I do know that many servers temporarily fail to operate properly during the weekends. You know, maintenance and such things. If the script was working before, I'd try it on a working day before concluding that something is wrong.