Download Genomes Of All Sequenced Genomes (Draft Or Complete) Within A Phyla From Ncbi Or Jgi?
2
0
Entering edit mode
11.1 years ago
microbeatic ▴ 80

The genomes in the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) is listed in alphabetic order with bioproject id at the end. And, there is no taxonomic information in the name. Is there a way to only download genomes that belongs to specific phyla? For example, how do i download all the genome folders that belong to Actinobacteria.

ncbi genome bacteria • 9.0k views
ADD COMMENT
2
Entering edit mode
11.1 years ago

" And, there is no taxonomic information in the name"

wrong: you can find the taxon in ftp://ftp.ncbi.nih.gov/genomes/Bacteria/summary.txt

for example: the file Acaryochloris_marina_MBIC11017_uid58167

Accession    GenbankAcc    Length    Taxid    ProjectID    TaxName    Replicon    Create Date    Update Date
NC_009926.1    CP000838.1    374161    329726    58167    Acaryochloris marina MBIC11017    plasmid pREB1    Oct 17 2007    Jun 10 2013  7:03:09:346PM

and in http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=329726&retmode=xml

 <Lineage>cellular organisms; Bacteria; Cyanobacteria; Oscillatoriophycideae; Chroococcales; Acaryochloris; Acaryochloris marina</Lineage>
ADD COMMENT
0
Entering edit mode

Wow, didn't realized this. My bad. Thanks. Is there a summary information for DRAFT too?. I scrolled through the folder but didn't see it.

ADD REPLY
0
Entering edit mode

for planctomycete_KSU_1_uid163683 , I found it in the gbk file "ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/planctomycete_KSU_1_uid163683/NZ_BAFH00000000.gbk " /db_xref="taxon:247490"

ADD REPLY
0
Entering edit mode

yes, thats for individual genome. A summary like the ones for complete genome ftp would have been better.

Right now i have a rsync set up between ftp and my database. But the list of genomes is only semi automatic right now. I can use the summary file in complete genome ftp to create a list with actinos and make the list totally automatic, but I am confused on how would i do it for the draft genome ftp. Do i have to read in all .gbk files for each organism in that folder?

ADD REPLY
0
Entering edit mode
11.1 years ago
Phil S. ▴ 700

Hi, maybe this perl script solves your problem:

# This script downloads all genomes of the given organism in RefSeq and puts them in organism.fa
# Script is taken from: http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large 

use LWP::Simple;


if($#ARGV + 1 > 0) {
    $organism = $ARGV[0];
} else {
    $organism = 'Fungi';
}

$query = $organism.'[orgn]+AND+srcdb_refseq[prop]';
print STDERR "Searching RefSeq for $organism: $query\n";
#assemble the esearch URL
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=nucleotide&term=$query&usehistory=y";


#post the esearch URL
$output = get($url);


#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);

print STDERR "Found: $count records for $organism\n"; 
if($count == 0) {
    exit(0);
}

#open output file for writing
open(OUT, ">tmp.$organism.fa") || die "Can't open file!\n";


#retrieve data in batches of 500
$retmax = 500;
for ($ret = 0; $ret < $count; ) {
    $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
    $efetch_url .= "&query_key=$key&retstart=$ret";
    $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
    $efetch_out = get($efetch_url);
    $actual_sequences_returned = $efetch_out =~ s/>/\n>/g;  # count number of sequences returned
    $ret += $actual_sequences_returned;
    print OUT "$efetch_out";
    print STDERR "Fetched $ret\n";
}
close OUT;

rename("tmp.$organism.fa", "$organism.fa");

it is used by:

perl scriptname organismname

in your case

perl scriptname Actinobacteria

Default behaviour is to download fungi...

cheers

ps. see also Ncbi Refseq Viral Genomes

ADD COMMENT
0
Entering edit mode

This is downloading sequences inside a category and storing them in one file, not genomes.

ADD REPLY
0
Entering edit mode

afaik this just downloads genome sequences, a first look into the file suggested 'xxx complete genome....'

ADD REPLY

Login before adding your answer.

Traffic: 1661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6