Hi Biostars, I'm currently working on a few pipelines to process and perform various analyses on human RNA-seq data. One of the steps in all of the pipelines is removal of contaminating microbial reads from the input fastq files. Based on recommendations here and elsewhere, I'm using the BBSplit program from BBMap.
My question is in regard to which potential sources of contamination I should be mapping to. Currently, I've downloaded essentially all the microbial RefSeq assemblies (bacterial, archaeal, protozoan, viral, fungal) and concatenated them together into a single "contaminants" fasta file.
However, using all of these genomes makes the analysis take a couple of hours for each sample, and more importantly uses a prohibitively large amount of memory (> 500 GB). I'd like to pair down the number of microbial assemblies I use in this analysis, but I'm not sure where to start.
Are there any "standard" sets of genomes that people typically use when decontaminating fastq data? Alternatively, if anybody has performed this sort of analysis before and has suggestions for which species I should (or shouldn't!) include, I'd love to get your advice.
Thanks!
Dave
Is contamination a real concern? If your analysis pipeline includes mapping to the human genome, most or all contaminants would be filtered out at this stage.
That's a good question. Contamination is not of special concern (this isn't ancient DNA!). I mostly just wanted to be thorough. But yes, all pipelines will involve either mapping to the human reference genome or transcriptome.