Question

BBMerge / Tadpole error correction

0

Entering edit mode

3.3 years ago

bio23 • 0

I've been using BBMerge recently to address a very specific problem:

I am sequencing pooled short DNA molecules (< 400bps) using paired end reads (average length ~ 230 bps post trimming)
Each molecule can be assumed to be different (i.e. contains sequence differences - substitutions & indels - with respect to a reference sequence)
I am interested in detecting these per-molecule differences, as well as the global (all reads) profile of sequence differences

We sequence these short molecules very deeply - around 500K reads per library (of pooled short DNA molecules). Because the molecules are short and the reads relatively long, we are able to recover the full sequence of the molecule, and reads generally overlap enough that we can merge them using BBMerge.

Initially, to exclude any sequencing errors, I merged reads with the pfilter=1 options - to make sure that only perfect overlaps are allowed. This works well, but we end up throwing away a lot of reads.

As a second approach, I have started using the error correction (ecct) module of BBMerge:

bbmerge-auto.sh \
      in1=$r1 \
      in2=$r2 \
      adapters=$params.adapters \
      outc=${sample_id}_stats.txt \
      showhiststats=t \
      maxstrict=t \
      out=${sample_id}_merged.fq.gz \
      outu=${sample_id}_unmerged.fq.gz \
      qtrim=t \
      rem k=75 \
      extend2=1 \
      iterations=1 \
      ecct

Here, I am only interested in correcting errors in overlapping portions of the reads, not maximising the % reads merged by extending, so I use extend2=1, iterations=1

Again, this seems to work well, in that I see many more merged reads. However, I notice a strikingly different "error rate" in the merged error corrected molecules than I do in either raw (unmerged) reads, or merged reads (limiting to perfect overlaps pfilter=1).

Notably, I see almost no substitutions in the error corrected reads, which makes me wonder what exactly is going on. What we have observed before is that there are sequences differences distributed across the length of each molecule, and between different molecules, so this makes me thing that the error correction is somehow "correcting" errors with respect to the reference sequence, rather than the read pair. Although I can't see how this is possibly the case!

My understanding is that the error correction is simply removing sequencing errors by examining the overlapping region of a read pair, and taking the highest quality base as a consensus for each position. Is this the case?

Does anyone have any suggestions / insight into this?

Thanks!

reads merging MiSeq ngs • 1.1k views

ADD COMMENT • link updated 2.7 years ago by hopf ▴ 10 • written 3.3 years ago by bio23 • 0

score 0 · Answer 1 · 2022-04-18

I think you want to be using ecco (error correction with overlaps), rather than ecct. Have you tried that? ecct uses tadpole for error correction, which is based on k-mers rather than consensus. My understanding of how it works suggests it would produce the results you are seeing if you have a large number of similar variant sequences.