Hi all,
this one is a (I guess) tricky question...
RNA virus discovery from metagenome/metatranscriptome dataset (overall from environmental samples) is particularly difficult because of their VERY DIVERGENT genome sequences, with poor relationship with what is available in reference sequence databases.
Can you recommend a "typical" protocol for this?
I found 2 "versions" by now:
#FIRST PROTOCOL# - Assemble reads with Trinity or metaSPAdes. - Do tBLASTn with the generated contigs/scaffolds against a database made of RNA virus proteins (ssRNA and dsRNA viruses). Use an e-value cutoff of <=10-3. - All candidate contigs screened by the previous step are queried against NCBI RefSeq db using BLASTx. - Only contigs with topmost hits to viruses are kept. - Binning to distinct viral groups according to their best blast hits.
#SECOND PROTOCOL# - Assemble reads with Trinity or metaSPAdes. - Do BLASTx with generated contigs/scaffolds against a database made of RNA virus proteins (ssRNA and dsRNA viruses). Use an e-value cutoff of <=10-5. - All candidate contigs are converted into proteins with Prodigal. - The proteins are queried against CDD blast (0.01 cutoff) to look for conserved domains. - Keep the contigs containing domains of RNA-dependent RNA-polymerases or reverse-transcriptases. - Contigs containing those domains are queried against NCBI nr db using BLASTx to discard "false-positives". Only contigs with hits to viruses are kept.
Thoughts?
Thanks very much in advanced!
I am currently trying to clean the set of reads prior to assembly (with Trinity, also trying Oases). I use centrifuge against nt and take only reads that are either classified as viral (very few), unclassified or not classified as the host. I have a couple of virus transcripts via blastx against nr, but the sequence divergence is a pain in the ass.