The genomes in the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) is listed in alphabetic order with bioproject id at the end. And, there is no taxonomic information in the name. Is there a way to only download genomes that belongs to specific phyla? For example, how do i download all the genome folders that belong to Actinobacteria.
yes, thats for individual genome. A summary like the ones for complete genome ftp would have been better.
Right now i have a rsync set up between ftp and my database. But the list of genomes is only semi automatic right now. I can use the summary file in complete genome ftp to create a list with actinos and make the list totally automatic, but I am confused on how would i do it for the draft genome ftp. Do i have to read in all .gbk files for each organism in that folder?
# This script downloads all genomes of the given organism in RefSeq and puts them in organism.fa# Script is taken from: http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large
use LWP::Simple;
if($#ARGV + 1 > 0) {$organism=$ARGV[0];}else{$organism='Fungi';}$query=$organism.'[orgn]+AND+srcdb_refseq[prop]';
print STDERR "Searching RefSeq for $organism: $query\n";#assemble the esearch URL$base='http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';$url=$base."esearch.fcgi?db=nucleotide&term=$query&usehistory=y";#post the esearch URL$output= get($url);#parse WebEnv, QueryKey and Count (# records retrieved)$web=$1if($output=~ /<WebEnv>(\S+)<\/WebEnv>/);$key=$1if($output=~ /<QueryKey>(\d+)<\/QueryKey>/);$count=$1if($output=~ /<Count>(\d+)<\/Count>/);
print STDERR "Found: $count records for $organism\n";
if($count== 0){
exit(0);}#open output file for writing
open(OUT, ">tmp.$organism.fa")|| die "Can't open file!\n";#retrieve data in batches of 500$retmax= 500;for($ret= 0;$ret<$count;){$efetch_url=$base."efetch.fcgi?db=nucleotide&WebEnv=$web";$efetch_url .="&query_key=$key&retstart=$ret";$efetch_url .="&retmax=$retmax&rettype=fasta&retmode=text";$efetch_out= get($efetch_url);$actual_sequences_returned=$efetch_out=~ s/>/\n>/g;# count number of sequences returned$ret +=$actual_sequences_returned;
print OUT "$efetch_out";
print STDERR "Fetched $ret\n";}
close OUT;
rename("tmp.$organism.fa", "$organism.fa");
Wow, didn't realized this. My bad. Thanks. Is there a summary information for DRAFT too?. I scrolled through the folder but didn't see it.
for planctomycete_KSU_1_uid163683 , I found it in the gbk file "ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/planctomycete_KSU_1_uid163683/NZ_BAFH00000000.gbk " /db_xref="taxon:247490"
yes, thats for individual genome. A summary like the ones for complete genome ftp would have been better.
Right now i have a rsync set up between ftp and my database. But the list of genomes is only semi automatic right now. I can use the summary file in complete genome ftp to create a list with actinos and make the list totally automatic, but I am confused on how would i do it for the draft genome ftp. Do i have to read in all .gbk files for each organism in that folder?