To get the name of the strains by searching assembly genome number GCF_
1
0
Entering edit mode
6.8 years ago
horsedog ▴ 60

I have a bunch of refseq assembly genome number likeGCF_002514765.1,GCF_002485085.1,GCF_002201835.1,GCF_000593305.2,GCF_001887655.1,GCF_000194215.1,GCF_002098145.1,GCF_002807875.1

Now I want to use these to search which genome it is , for example, the first one is Escherichia coli strain MOD1-EC3823, I try to use efetch to achieve this, but seems it does not work, it says "urllib.error.HTTPError: HTTP Error 400: Bad Request" here is my python code:

from Bio import Entrez
Entrez.email = "hulala@gmail.com"
ID = open("assembly_ID").read()
handle = Entrez.efetch(db="assembly", id= ID, rettype="gb")
print(handle.read())

Does anyone have any idea?

NCBI efetch python • 2.3k views
ADD COMMENT
2
Entering edit mode
6.8 years ago
Joseph Hughes ★ 3.0k

Re-writting the following query in python should get you what you want:

esearch -db assembly -query "GCF_002514765.1" | esummary | xtract -pattern DocumentSummary -element SpeciesName Sub_type Sub_value

The output is:

Escherichia coli    strain  MOD1-EC3823
ADD COMMENT
0
Entering edit mode

Hi , thanks , but it says "SyntaxError: invalid syntax" at Sub_value do you mean by replacing

ID = open("assembly_ID").read()
handle = Entrez.efetch(db="assembly", id= ID, rettype="gb")

by your code? but here the -query is not just one ID, there are thousands of

ADD REPLY
0
Entering edit mode

you will need to do a loop in your python code to query each accession one at a time.

ADD REPLY

Login before adding your answer.

Traffic: 2851 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6