I want to create custom blastdb with all viruses available in the refseq. But I don't know which source files to use. My first point is: ftp://ftp.ncbi.nih.gov/genomes/Viruses/
From research I concluded that I might need the
all.fna.tar.gz
file, since it supposedly contains nucleotide information for all viruses in the refseq, however it turned out that, for example, the Bluetongue_virus_uid14938 is doesn't have an entry in this archive BUT it has a directory and respectively files if I want download the all.gbk.tar.gz archive.
So my question is which archive (file types) should I use in order to create the most complete database of viruses that are in refseq? SHould I used the fna/ffn and just concatenate the files and send them to makeblastdb OR should I manually parse the .gbk files and create fasta files out of them - involving basically extracing the respective fasta sequences from each .gbk and rebuilding the header?
If you follow Pierre's advice you do not need to "go over each of the records" - simply choose "send to file" and save as fasta. Also, if you require sequences from other taxa (fungi, bacteria), you should edit your question, since originally you mentioned only viruses.
how can i do this with linux. i mean with command line.
Do what - just download, or search and download? The former: use wget + link to file in FTP site. The latter: read about EUtils.
@neilfws the "Send to" part was crucial - up until now I've been working with the FTP files. Thanks
Thanks but this is not really a workable solution because I need to to do this in a fashion which would facilitate bulk additions and not just of 10-15 viruses
what do you mean with "in a fashion which would facilitate bulk additions..." ? 3946 records: what is missing in this entrez dataset ?
By "bulk addition" I mean i have to be able to automate it somehow - going over each of those records and manually downloading the entries is just not an option. Programatically working with the NBIC's ftp is easy enough, the thing is there are SO MANY files/releases that I don't really know which one contains the information I need. Is there some place which explains WHAT is held WHERE on the NCBI ftp - I've read the README's for the various directories but it still is not clear to me.
Basically I want to create a single blast db which has ALL viral refseq, all bacterial refseq and all fungi refseq - where can I get those respective db's. I thought that <ftp: ftp.ncbi.nih.gov="" genomes="" <whatever-i'm=""> looking for> would suffice but as it turns out - it doesn't.