Question

How to create a Blast database of viruses ?

2

Entering edit mode

10.6 years ago

bell ▴ 20

Hi all,

For a metagenomic project a want to make a blast database of viruses. I dont want to blast my reads on the entire nt database. But I dont know how to make it. Downloading the result of a query like viruses[organism] from the nucleotide database of NCBI is impossible, due to the weight of the data. Maybe there is a solution using the taxononomy files for extract sequences of the Fasta nt file available on the ncbi ftp ?

So could anyone give me a solution ?

Thank you very much !

blast sequence • 11k views

ADD COMMENT • link updated 10.6 years ago by Carlos Borroto ★ 2.1k • written 10.6 years ago by bell ▴ 20

Ram · Answer 1 · 2014-04-18

4

Entering edit mode

10.6 years ago

Pierre Lindenbaum 164k

Download the VRL division of genbank ftp://ftp.ncbi.nih.gov/genbank/gbvrl*.seq.gz and index it with blast

ADD COMMENT • link 10.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi Pierre, I am doing something similar at the moment, and this looks like a very good solution. Maybe we should mention that the data is in genbank format (obviously) and needs to be converted to fasta before making a blastdb. However, when using BioPerl SeqIO to convert the fasta headers look like this:

>AB000048 Feline panleukopenia virus gene for nonstructural protein 1, complete cds, isolate: 483.

So, no gi's here, but they would be needed to assign taxids for metagenomics, any quick fix to keep the gi?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Michael 55k

2

Entering edit mode

@Michael use awk?

$ curl -s "ftp://ftp.ncbi.nih.gov/genbank/gbvrl27.seq.gz" | gunzip -c | awk -f jeter.awk

>gi:422089830|Hepatitis C virus isolate V2401 NS5AB replicase gene, partial cds.
tggattaacgaggactgctccacgccatgctccggctcgtggctaaaggatgtttgggac
tggatatgcacggtgctgtctgatttcagaacctggctccagtccaagctcctgccgcgg
ytaccgggagtccctttcttctcgtgtcaacgtggatataagggagtctggcggggygac
ggcatcatgcaaaccacctgttcatgtggggcacagatcaccggacatgtcaaaaacggc

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

+1 for awk regex tricks :-)

ADD REPLY • link 9.1 years ago by biocyberman ▴ 870

0

Entering edit mode

what is jeter.awk ?!

ADD REPLY • link 8.9 years ago by Quak ▴ 520

0

Entering edit mode

it's the awk script above. jeter in french means "trashed" (a name I use for temporary files)

ADD REPLY • link 8.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi, it doesn't seem to work with some sequences, i mean some sequences after the scritp just appear empty..

ADD REPLY • link 7.7 years ago by luisitosrt • 0

0

Entering edit mode

I am just wondering what the correct number of entries is:

the gbvrl files converted to fasta files contain 1584206 entries
the ncbi query 'Viruses[Organism] NOT cellular organisms[ORGN] NOT AC_000001:AC_999999[pacc]' yields 1741019
when I download this query via efetch I retrieve only 1737392 entries

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Michael 55k

Ram · Answer 2 · 2014-04-18

3

Entering edit mode

10.6 years ago

Peter 6.0k

Related to Pierre's answer, if you want complete virus genomes, there are FASTA files available at ftp://ftp.ncbi.nih.gov/genomes/Viruses/

You can download complete genomes via NCBI Entrez but that is more problematic, see http://blastedbio.blogspot.co.uk/2013/11/entrez-trouble-with-chimeras.html

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Peter 6.0k

Ram · Answer 3 · 2014-04-18

I was involved in a project where keeping an updated viral database was key to our success. We went the route recommended by Pierre. It was extremely hard. The files linked by Pierre need to be first downloaded and then transformed from Genbank to fasta format, easily doable with any bio*(python, perl, ruby, etc) but painfully slow. You also need to remove redundancy or your results will be extremely noisy.

If I had to start over I would do something smarter. I would keep a list of GIs known to be from viral sequences and use 'blastn' option -gilist with the nt/nr databases provided by NCBI. See http://www.ncbi.nlm.nih.gov/books/NBK1763/. This option limits results to hits matching GIs in the provided list. Keeping such a list updated will definitely be easier than house-keeping a custom blast database.