Question

Ncbi Refseq Viral Genomes

5

Entering edit mode

13.1 years ago

Lorddoskias ▴ 160

I want to create custom blastdb with all viruses available in the refseq. But I don't know which source files to use. My first point is: ftp://ftp.ncbi.nih.gov/genomes/Viruses/

From research I concluded that I might need the

all.fna.tar.gz

file, since it supposedly contains nucleotide information for all viruses in the refseq, however it turned out that, for example, the Bluetongue_virus_uid14938 is doesn't have an entry in this archive BUT it has a directory and respectively files if I want download the all.gbk.tar.gz archive.

So my question is which archive (file types) should I use in order to create the most complete database of viruses that are in refseq? SHould I used the fna/ffn and just concatenate the files and send them to makeblastdb OR should I manually parse the .gbk files and create fasta files out of them - involving basically extracing the respective fasta sequences from each .gbk and rebuilding the header?

ncbi blast • 14k views

ADD COMMENT • link updated 11.2 years ago by bw. ▴ 260 • written 13.1 years ago by Lorddoskias ▴ 160

score 10 · Answer 1 · 2011-10-29

10

Entering edit mode

13.1 years ago

Pierre Lindenbaum 164k

Search NCBI entrez for

"Viruses"[PORG] AND srcdb_refseq[PROP]

http://www.ncbi.nlm.nih.gov/sites/nuccore?term=%22Viruses%22[PORG]+AND+srcdb_refseq[PROP]

download your result as FASTA and create your blast database.

ADD COMMENT • link 13.1 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

If you follow Pierre's advice you do not need to "go over each of the records" - simply choose "send to file" and save as fasta. Also, if you require sequences from other taxa (fungi, bacteria), you should edit your question, since originally you mentioned only viruses.

ADD REPLY • link 13.1 years ago by Neilfws 49k

1

Entering edit mode

how can i do this with linux. i mean with command line.

ADD REPLY • link 11.9 years ago by rehma.ar ▴ 290

2

Entering edit mode

Do what - just download, or search and download? The former: use wget + link to file in FTP site. The latter: read about EUtils.

ADD REPLY • link 11.9 years ago by Neilfws 49k

1

Entering edit mode

@neilfws the "Send to" part was crucial - up until now I've been working with the FTP files. Thanks

ADD REPLY • link 13.1 years ago by Lorddoskias ▴ 160

0

Entering edit mode

Thanks but this is not really a workable solution because I need to to do this in a fashion which would facilitate bulk additions and not just of 10-15 viruses

ADD REPLY • link 13.1 years ago by Lorddoskias ▴ 160

0

Entering edit mode

what do you mean with "in a fashion which would facilitate bulk additions..." ? 3946 records: what is missing in this entrez dataset ?

ADD REPLY • link 13.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

By "bulk addition" I mean i have to be able to automate it somehow - going over each of those records and manually downloading the entries is just not an option. Programatically working with the NBIC's ftp is easy enough, the thing is there are SO MANY files/releases that I don't really know which one contains the information I need. Is there some place which explains WHAT is held WHERE on the NCBI ftp - I've read the README's for the various directories but it still is not clear to me.

ADD REPLY • link 13.1 years ago by Lorddoskias ▴ 160

0

Entering edit mode

Basically I want to create a single blast db which has ALL viral refseq, all bacterial refseq and all fungi refseq - where can I get those respective db's. I thought that <ftp: ftp.ncbi.nih.gov="" genomes="" <whatever-i'm=""> looking for> would suffice but as it turns out - it doesn't.

ADD REPLY • link 13.1 years ago by Lorddoskias ▴ 160

score 1 · Answer 2 · 2013-09-25

Downloading via "Send To" didn't give the full fasta - the file only contained 640 sequences out of over 4000 (not sure why).

There's a perl script shown in EUtils docs that should do the job (http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large), but it assumes that the number of sequences returned by NCBI servers on every 'get' call is equal to the value of the "retmax" url arg - which turns out not to be the case - sometimes they return less, which causes the script to download less sequences overall.

I fixed it so that it retrieves all 4944 viral sequences:

# This script downloads all viral genomes in RefSeq and puts them in viruses.fa
# Script is taken from: http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large 

use LWP::Simple;


if($#ARGV + 1 > 0) {
    $organism = $ARGV[0];
} else {
    $organism = 'viruses';
}

$query = $organism.'[orgn]+AND+srcdb_refseq[prop]';
print STDERR "Searching RefSeq for $organism: $query\n";
#assemble the esearch URL
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=nucleotide&term=$query&usehistory=y";


#post the esearch URL
$output = get($url);


#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);

print STDERR "Found: $count records for $organism\n"; 
if($count == 0) {
    exit(0);
}

#open output file for writing
open(OUT, ">tmp.$organism.fa") || die "Can't open file!\n";


#retrieve data in batches of 500
$retmax = 500;
for ($ret = 0; $ret < $count; ) {
    $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
    $efetch_url .= "&query_key=$key&retstart=$ret";
    $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
    $efetch_out = get($efetch_url);
    $actual_sequences_returned = $efetch_out =~ s/>/\n>/g;  # count number of sequences returned
    $ret += $actual_sequences_returned;
    print OUT "$efetch_out";
    print STDERR "Fetched $ret\n";
}
close OUT;

rename("tmp.$organism.fa", "$organism.fa");