Question

To get the name of the strains by searching assembly genome number GCF_

0

Entering edit mode

7.6 years ago

horsedog ▴ 60

I have a bunch of refseq assembly genome number likeGCF_002514765.1,GCF_002485085.1,GCF_002201835.1,GCF_000593305.2,GCF_001887655.1,GCF_000194215.1,GCF_002098145.1,GCF_002807875.1

Now I want to use these to search which genome it is , for example, the first one is Escherichia coli strain MOD1-EC3823, I try to use efetch to achieve this, but seems it does not work, it says "urllib.error.HTTPError: HTTP Error 400: Bad Request" here is my python code:

from Bio import Entrez
Entrez.email = "hulala@gmail.com"
ID = open("assembly_ID").read()
handle = Entrez.efetch(db="assembly", id= ID, rettype="gb")
print(handle.read())

Does anyone have any idea?

NCBI efetch python • 2.5k views

ADD COMMENT • link updated 7.6 years ago by Joseph Hughes ★ 3.0k • written 7.6 years ago by horsedog ▴ 60

score 2 · Answer 1 · 2018-01-23

2

Entering edit mode

7.6 years ago

Joseph Hughes ★ 3.0k

Re-writting the following query in python should get you what you want:

esearch -db assembly -query "GCF_002514765.1" | esummary | xtract -pattern DocumentSummary -element SpeciesName Sub_type Sub_value

The output is:

Escherichia coli    strain  MOD1-EC3823

ADD COMMENT • link 7.6 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

Hi , thanks , but it says "SyntaxError: invalid syntax" at Sub_value do you mean by replacing

ID = open("assembly_ID").read()
handle = Entrez.efetch(db="assembly", id= ID, rettype="gb")

by your code? but here the -query is not just one ID, there are thousands of

ADD REPLY • link 7.6 years ago by horsedog ▴ 60

0

Entering edit mode

you will need to do a loop in your python code to query each accession one at a time.

ADD REPLY • link 7.6 years ago by Joseph Hughes ★ 3.0k