I am analyzing NGS (WGS) data, and I am looking for possible ways to reduce the level of "noise" and effectively filter false positives and low level DNA contamination. What I have noticed on alignment and sequence assembly is that small DNA fragments seem to be "over-amplified" and are skewing the overall read-count and identification of the true microbial community. I would like to add a setting to the script that caps the number of reads that may align to a single k-mer, with both values (# of reads and length of k) user manipulative. I'm not sure if there is a feature of BBMap that allows me to do this or one that I am not aware of. Any input is greatly appreciated.
I'm not sure if this addresses your issue, but you could do this:
1) Gather a set of kmers that are problematic. If this is because the count is too high, you can do so with kmercountexact.sh, which will print all kmers occurring in your dataset above some count threshold.
2) Filter out the reads containing those kmers with bbduk.sh, or else mask them in the reference so nothing will map to them.
Alternatively you can normalize the data to strictly downsample reads with an apparent depth above a certain value, based on kmer-counts, without actually getting rid of all of them. For example, it could leave reads with depth up to 100x untouched, but reduce things with 2000x coverage down to 100x (e.g. bbnorm.sh in=reads.fq out=normalized.fq target=100 mindepth=0)
Both of these approaches are kind of brute-force-ish; it would be better if you could identify specific artifacts and remove them instead. Could this be synthetic sequence, or a virus, or ribosomal DNA?
Thanks so much for the suggestions.
I will look into both of these - I am familiar with kmercountexact.sh, but I have not yet attempted to filter them.
I do not believe these are synthetic sequences or viruses - though I have yet to pull the reads and manually BLAST them. Adapter trimming has been applied and the taxonomic ID software and database I am currently using appear to sufficiently call synthetic construct and viruses as well as estimated %rRNA. I will revisit, though.
Again, thank you!