Hi biostars!
I'd like to write a program to download some complete genome fasta files of a list of species from NCBI. However, when I use BioPython, I can not get the result I want.
from Bio import Entrez
Entrez.email = "thustar@mlp.edu"
search_term = 'Acidaminococcus sp. D21'
handle = Entrez.esearch(db='nucleotide', term=search_term)
record = Entrez.read(handle)
ids = record['IdList']
ids will return lots of numbers and some of ids are not fasta files. Is there a better method to fetch exactly a file contain all contigs in a certain genome or at least automatically remove those ids whose corresponding files are not sequence files?
Thanks
Unfortunately, after I change search_term to "Acidaminococcus sp. D21[orgn] AND complete genome", I got ids = ['224815814', '224815813', '224815811', '224815805', '224815803']. The lengths of the fasta file fetched are 118 645 690 1817 3389
There must be something wrong because the length of reference genome is several million bases. The codes of fetching procejure are these: for i in xrange(len(ids)): handle = Entrez.efetch(db="nucleotide", id=ids[i], rettype="fasta", retmode="text") record = handle.read() print len(record)
Any further information?
You have quoted the search term in the wrong way. Please try:
https://www.ncbi.nlm.nih.gov/nuccore/?term=%22Acidaminococcus%20sp.%20D21%22%5BOrganism%5D%20AND%20(complete%5BProperties%5D%20or%20%22wgs%20master%22%5BProperties%5D)
This will show you that there is currently no complete genome available for "Acidaminococcus sp. D21", but one WGS set.
There is one in RefSeq genomes.
http://ftp.ncbi.nih.gov/genomes/refseq/bacteria/Acidaminococcus_sp._D21/latest_assembly_versions/GCF_000174215.1_ASM17421v1/
Thanks for your reply. I guess @5heikki provides a good solution, but still have some steps to make it perfect. I would really appreciate it if you could tell me where to download the missing 38 genomes in my reply to @5heikki