Entering edit mode
20 months ago
Dave Carlson
★
2.1k
Hi All,
I'm using BBduk to filter out reads where there is a kmer match to a specific set of contaminant sequences. Here is an example command:
bbduk.sh \
in1=${FQ1} \
in2=${FQ2} \
ref=${contaminant_fasta} \
k=21 \
hdist=1 \
stats=${OUTPUT}/${SAMPLE}_stats.txt \
out1=${OUTPUT}/${SAMPLE}_clean_R1.fastq \
out2=${OUTPUT}/${SAMPLE}_clean_R2.fastq \
outm=${OUTPUT}/${SAMPLE}_contaminant_match.fastq
However, I noticed that there appears to be a small inconsistency between the number of reads that match the contaminant fasta file in the job log vs the number that match the contaminant fasta file in the stats file.
Job log:
Input: 15786272 reads 2234206006 bases.
Contaminants: 82980 reads (0.53%) 11953545 bases (0.54%)
Total Removed: 82980 reads (0.53%) 11953545 bases (0.54%)
Result: 15703292 reads (99.47%) 2222252461 bases (99.46%)
Stats file:
#Total 15786272
#Matched 70789 0.44842%
#Name Reads ReadsPct
Obviously, the difference is not large, but I'm wondering why the read counts and percentages don't match exactly. Has anyone ever experienced this before?
Do you have more than one sequence in contaminant fasta? Can you write out
refstats
file? How many reads end up in theoutm
file (is it 82980 or 70789)?There is only a single sequence in the contaminant fasta. Here is the refstats output:
The number of reads in the
outm
file is 82980--i.e., it matches the log, not the stats file.The default behavior of bbduk is to remove both the forward and reverse reads if either one of them has a kmer match.
So one possibility could be that the stats file is reporting the total number of reads that had a kmer match, while the log file is reporting the total number of reads that were removed (including reads that did not have a kmer match but were removed because its mate did).
That is logical. Stats file gives reads that had a match and the output log gives the total number of reads removed.