Hi All,
I'm trying to use BBduk to find and filter exact 18-mer matches in a contaminant fasta file from a set of input reads. BBduk reports that matches exist, however when I examine the reads that are supposed to include one or more kmers, I cannot find the matching kmer (or its reverse complement) in many of the reads.
So I'm trying to better understand what BBduk is doing in hopes that I can figure out why more results are not what I expect.
Here is a toy example that illustrates my confusion.
Input fasta (just a single string of 18 nucleotides for illustration purposes):
>kmer1
CTGGGAGACACCGCATGG
Input fastq file:
@NB551611:70:HGNJ5AFX5:1:11211:6538:6918 2:N:0:ACAGTG
AAAAGGATCTAGCATAGGGAAGATGTTCGAAGCAACCGCTAGGGGAGCTAGACGAATGGCAATACTGGGAGATACCGCATGGGACTTCGGATCGATAGGGGAGCTAGACGAATGGCAATACTGGGAGATACCGCATGGGACTTCGGAGATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEAEEEEEEEE/<//////A//<AEEEA/<A/6<//A/////</AEA<E/E/<</A6AE/A66/A/E<EE/
My BBduk command:
bbduk.sh in=test.fq ref=one_kmer.fa hdist=0 k=18 stats=test.stats out=test.clean.fq kmask=lc
In this case, I have set hdist=0
because I think this will require exact kmer matches, and I have set kmask=lc
to mask print the matching sequence in the output fastq file in lowercase so that it's easier to see what BBduk is considering a match.
Here is the output of the stats file confirming that BBduk found a match:
#File test.fq
#Total 1
#Matched 1 100.00000%
#Name Reads ReadsPct
kmer1 1 100.00000%
And here is the masked fastq output:
@NB551611:70:HGNJ5AFX5:1:11211:6538:6918 2:N:0:ACAGTG
AAAAGGATCTAGCATAGGGAAGATGTTCGAAGCAACCGCTAGGGGAGCTAGACGAATGGCAATActgggagataccgcatggGACTTCGGATCGATAGGGGAGCTAGACGAATGGCAATActgggagataccgcatggGACTTCGGAGATA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEAEEEEEEEE/<//////A//<AEEEA/<A/6<//A/////</AEA<E/E/<</A6AE/A66/A/E<EE/
There are two reported matches in this read, but it's apparent to see that neither is actually an exact match to my input kmer. My input "reference" kmer has a "C" at the 9th position, while the reported matching kmers both have a "T" at that position.
So I don't understand why there is a reported match here. Does hdist=0
actually allow mismatches?
Does anybody have any insights?
Thanks! Dave
Thanks a lot, GenoMax!
I had completely overlooked the
maskmiddle
default setting, and that was what was throwing me off. I appreciate the help!