Is there any method to download a complete genome fasta file of a certain species from NCBI automatically
3
0
Entering edit mode
8.4 years ago
thustar ▴ 130

Hi biostars!

I'd like to write a program to download some complete genome fasta files of a list of species from NCBI. However, when I use BioPython, I can not get the result I want.

from Bio import Entrez Entrez.email = "thustar@mlp.edu" search_term = 'Acidaminococcus sp. D21' handle = Entrez.esearch(db='nucleotide', term=search_term) record = Entrez.read(handle) ids = record['IdList']

ids will return lots of numbers and some of ids are not fasta files. Is there a better method to fetch exactly a file contain all contigs in a certain genome or at least automatically remove those ids whose corresponding files are not sequence files?

Thanks

genome • 4.8k views
ADD COMMENT
1
Entering edit mode
8.4 years ago
thustar ▴ 130

Thanks for everybody answering this question!

I would like to share my python codes to solve this problem partially.

import os
import cPickle

database_file = "assembly_summary_genbank.txt"
destination='/data/thustar/composition_explore/genomes/'
miss_list = []
def fetch_genome(genome_name,database_file,destination):
    global miss_list
    exist = False
    temp = os.popen('grep "'+genome_name+'" '+database_file).read()#read bash result in python, if grep did not search, return ''
    if len(temp)>0:
        link = temp.split('\t')[19]
        extra_name = link.split('/')[-1]
        link = link+'/'+extra_name+'_genomic.fna.gz'
        exist = True
        print "HIT "+genome_name
        if not os.path.isfile(destination+genome_name.replace(' ','_')+'.gz'):
            genome_name = genome_name.replace('/','_')
            os.system("wget -c "+link+" -O "+destination+genome_name.replace(' ','_')+'.gz')

        if not exist:
            print genome_name+" not exist"
            miss_list.append(genome_name)

with open('genomes.txt','r') as f:
    for line in f:
        genome_name = line.split('\t')
        genome_name = genome_name[0]
        fetch_genome(genome_name,database_file,destination)

with open('miss_list.txt','wb') as g:
    cPickle.dump(miss_list,g)
ADD COMMENT
0
Entering edit mode
8.4 years ago
piet ★ 1.9k

You should enclose the name of the taxon in double quotes, and you should restrict to sequences of complete replicons, see also A: Assembly Accession Number

search_term = '"Pseudomonas aeruginosa"[Organism] AND complete[Properties]'
ADD COMMENT
0
Entering edit mode

Unfortunately, after I change search_term to "Acidaminococcus sp. D21[orgn] AND complete genome", I got ids = ['224815814', '224815813', '224815811', '224815805', '224815803']. The lengths of the fasta file fetched are 118 645 690 1817 3389

There must be something wrong because the length of reference genome is several million bases. The codes of fetching procejure are these: for i in xrange(len(ids)): handle = Entrez.efetch(db="nucleotide", id=ids[i], rettype="fasta", retmode="text") record = handle.read() print len(record)

Any further information?

ADD REPLY
0
Entering edit mode

You have quoted the search term in the wrong way. Please try:

"Acidaminococcus sp. D21"[orgn] AND (complete[Properties] or "wgs master[Properties]")

https://www.ncbi.nlm.nih.gov/nuccore/?term=%22Acidaminococcus%20sp.%20D21%22%5BOrganism%5D%20AND%20(complete%5BProperties%5D%20or%20%22wgs%20master%22%5BProperties%5D)

This will show you that there is currently no complete genome available for "Acidaminococcus sp. D21", but one WGS set.

ADD REPLY
0
Entering edit mode

Thanks for your reply. I guess @5heikki provides a good solution, but still have some steps to make it perfect. I would really appreciate it if you could tell me where to download the missing 38 genomes in my reply to @5heikki

ADD REPLY
0
Entering edit mode
8.4 years ago

Different genomes have been sequenced by different institutes with different motivations and interests. As such there is no single site where you can find all the genome information that you may want.

Thus said the NCBI is a good place to start as they curate GenBank database whose contents get mirrored and exchanged with other meta-genomic warehouses such as EMBL and DDBJ.

Please have a look at this as well http://www.ncbi.nlm.nih.gov/sites/genome and this to download genome data for various organisms. ftp://ftp.ncbi.nlm.nih.gov/genomes/

I would suggest you refine your question to be more specific.

ADD COMMENT
0
Entering edit mode

OK. Thanks for your advice. I would like to download 290 genomes in this tabel: https://bitbucket.org/berkeleylab/metabat/wiki/genomes_table . Therefore I want to write a program to download them automatically. It is a waste of time to search and download 290 fasta files of complete genome manually.

I did search some genomes on http://www.ncbi.nlm.nih.gov/sites/genome, but I prefer to use some api or something automatical to download all files I need.

Do you have any suggestions?

ADD REPLY
2
Entering edit mode

grep the names from ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt

Column 20 provides the base ftp address for the assembly data.

ADD REPLY
0
Entering edit mode

Thanks a lot! That file is quite useful. Most of genomes I want could be found in it.

But there are still 38 genomes not found. They are

'Alistipes onderdonkii DSM 19147',
 'Bacteroidales bacterium ph8',
 'Bacteroides sp. 2_1_7',
 'Bacteroides pectinophilus ATCC 43243',
 'Clostridiales bacterium OBRC5-5',
 'Clostridium asparagiforme DSM 15981',
 'Clostridium bartlettii DSM 16795',
 'Clostridium bolteae ATCC BAA-613',
 'Clostridium cf. saccharolyticum K10',
 'Clostridium citroniae WAL-17108',
 'Clostridium clostridioforme 2_1_49FAA',
 'Clostridium hathewayi DSM 13479',
 'Clostridium leptum DSM 753',
 'Clostridium methylpentosum DSM 5476',
 'Clostridium nexile DSM 1787',
 'Clostridium ramosum DSM 1402',
 'Clostridium saccharolyticum WM1',
 'Clostridium spiroforme DSM 1552',
 'Clostridium symbiosum WAL-14163',
 'Clostridium symbiosum WAL-14673',
 'Eubacterium biforme DSM 3989',
 'Eubacterium cylindroides T2-87',
 'Eubacterium dolichum DSM 3991',
 'Eubacterium eligens ATCC 27750',
 'Eubacterium hallii DSM 3353',
 'Eubacterium rectale ATCC 33656',
 'Eubacterium rectale DSM 17629',
 'Eubacterium rectale M104/1',
 'Eubacterium siraeum DSM 15702',
 'Eubacterium siraeum V10Sc8a',
 'Eubacterium siraeum 70/3',
 'Lachnospiraceae bacterium 4_1_37FAA',
 'Lactobacillus acidophilus 30SC',
 'Odoribacter splanchnicus DSM 220712',
 'Ruminococcus obeum ATCC 29174',
 'Ruminococcus obeum A2-162',
 'Streptococcus salivarius JIM8780',
 'butyrate-producing bacterium SSC/2'

Is there any other source to find them?

ADD REPLY
0
Entering edit mode

Have you looked in here: http://ftp.ncbi.nih.gov/genomes/refseq/bacteria/ The names may be different since you may have strain designation from a different repository.

ADD REPLY
0
Entering edit mode

Yeah, I looked in that website. I can not find the genome names which could not be found by my program on that website. It is possible that name designation from different repository leads to name mismatch. If the problem is that the name on NCBI is different from the genome name I want to get, there should be some file like a dictionary to map different names in the same strain. Then the question will be how to translate those names which can not be found directly in the list to genome names corresponding to the same strain in the list?

ADD REPLY
0
Entering edit mode

With one random sample, this is with Entrez direct:

grep "$(esearch -db taxonomy -query "Clostridium nexile DSM 1787" | efetch -format docsum | xtract -element ScientificName)" assembly_summary_genbank.txt | awk 'BEGIN{OFS=FS="\t"}{print $8,$20}'
Tyzzerella nexilis DSM 1787 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000156035.2_ASM15603v2

You can see this info also here. Maybe the other ones are also like this..

ADD REPLY

Login before adding your answer.

Traffic: 2079 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6