I've been using BBMerge
recently to address a very specific problem:
- I am sequencing pooled short DNA molecules (< 400bps) using paired end reads (average length ~ 230 bps post trimming)
- Each molecule can be assumed to be different (i.e. contains sequence differences - substitutions & indels - with respect to a reference sequence)
- I am interested in detecting these per-molecule differences, as well as the global (all reads) profile of sequence differences
We sequence these short molecules very deeply - around 500K reads per library (of pooled short DNA molecules). Because the molecules are short and the reads relatively long, we are able to recover the full sequence of the molecule, and reads generally overlap enough that we can merge them using BBMerge
.
Initially, to exclude any sequencing errors, I merged reads with the pfilter=1
options - to make sure that only perfect overlaps are allowed. This works well, but we end up throwing away a lot of reads.
As a second approach, I have started using the error correction (ecct
) module of BBMerge
:
bbmerge-auto.sh \
in1=$r1 \
in2=$r2 \
adapters=$params.adapters \
outc=${sample_id}_stats.txt \
showhiststats=t \
maxstrict=t \
out=${sample_id}_merged.fq.gz \
outu=${sample_id}_unmerged.fq.gz \
qtrim=t \
rem k=75 \
extend2=1 \
iterations=1 \
ecct
Here, I am only interested in correcting errors in overlapping portions of the reads, not maximising the % reads merged by extending, so I use extend2=1, iterations=1
Again, this seems to work well, in that I see many more merged reads. However, I notice a strikingly different "error rate" in the merged error corrected molecules than I do in either raw (unmerged) reads, or merged reads (limiting to perfect overlaps pfilter=1
).
Notably, I see almost no substitutions in the error corrected reads, which makes me wonder what exactly is going on. What we have observed before is that there are sequences differences distributed across the length of each molecule, and between different molecules, so this makes me thing that the error correction is somehow "correcting" errors with respect to the reference sequence, rather than the read pair. Although I can't see how this is possibly the case!
My understanding is that the error correction is simply removing sequencing errors by examining the overlapping region of a read pair, and taking the highest quality base as a consensus for each position. Is this the case?
Does anyone have any suggestions / insight into this?
Thanks!