I have a large metagenomic RNA-seq dataset that I am trying to assemble to find viral sequences but it is too large for my hardware (52gb RAM). I can see that there is a lot of bacterial contamination from many different species when I BLAST reads. I want to filter out all bacterial reads so that I can assemble. Ideas?
Download all bacterial genomes from Refseq and try to bowtie to that (will take a long time). As well, since when has the compressed Refseq bacterial fna files reached 72gb (when combined)?!? The last all.bacteria.gz file in Refseq archive from 2015 is 2.7gb...
Somehow condense all bacterial genomes into non-redundant, then align?
Other ideas?
Have you checked Kraken or Kaiju to do the binning? You can even use MG-RAST to do the taxonomic classification and then download only the viral reads.
Kaiju will work with only 50GB RAM. You can also use the web server and upload your reads there for taxonomic classification.
If you are not looking for novel viral sequences then perhaps doing the binning in reverse may be better. Get the RefSeq viral sequences from here and then use BBSplit to bin the reads into virii and rest.
Thanks for your clarification: I am looking for novel viral sequences, so I want to filter using close alignment to known bacterial species. Asssembling part then aligning original reads to identified contaminants could work but their are too many different bacterial contaminant to make this practical
If novel things is the requirement then slogging though multiple rounds of alignments/assembly may be order of the day. As @Brian noted below this process is going to be fraught with hurdles and you are likely to hit many false positives along the way. I don't see an easy solution.