Hi, below is my "bbduk" command to remove bacterial reads from human DNA reads. The problem is that mapping percentage to the human genome is decreasing as it also removes bacterial reads. Is there a way around to make it more stringent without reducing mapping % to human genome, currently I think it's removing some of the human reads as well. Or any other tools that are more effective in removing contaminating genome from human reads? Thank you!
bbduk.sh -Xmx80g -Xms50g ref=$contaminants ordered=t k=35 hdist=0 ow=t in=fastq1 in1=fastq1 out=clean_fastq1 out2= clean_fastq2 outm=bad.fastq1 outm2=bad.fastq2
Hi Genomax, only a few samples have that issue. I determined it using kraken. There is same amout of contamination in all of them (~30%) However in my experience, removing bacterial reads may give better downstream results. I have used bbsplit in the past, as you suggested, but it gave me even less number of reads aligning to human genome than that were mapping using bwa. My strategy is to remove these contaminating reads and map them to human genome later. Do you suggest I use bbmap to align these reads to human genome and keep those that are aligning to further map using bwa? Could I align to bacterial genome and remove those reads instead? Thanks as always! :)
To me that indicates a slip-up somewhere on the experimental side (either sample collection or some subsequent processing). Not sure how that is going to affect your end result.
As I wrote above:
When you are dealing with short reads there is no way to know for sure if the read came from a particular genome if it matches equally well to human genome and something else.
When using
bbsplit.sh
If a read matches both human genome and bacterial genome then you can either discard it completely (being stringent) or keep it in both pools (ambiguous2=
option). You can use human genome + pseudomonas genome and decide on one of the strategies withbbsplit.sh
. Only reads you could safely discard are ones that bin to the Pseudomonas genome alone.BTW: Results you get from
bbmap
are going to be as good asbwa
so there is no need to do the mapping twice.Thanks GenoMax, would you suggest using the adaptor-trimmed reads (in my case nextera adaptors) or the raw reads (with adaptors) for bbsplit? Also, is it a good idea to use ambiguous=all and ambiguous2=all ? , since ambiguous are the reads with multiple top-scoring mapping locations.
Adapter trimmed reads are fine.
ambiguous2=
decides what happens to reads that multi-map across genomes. As I said above you need to decide if you want to keep them in both or toss them totally or use one best site.ambiguous=
does a similar thing for reads that are multi-mapping within one genome.