We are working on benchmarking long-read aligners, This leads to the need of generating simulated PacBio HiFi reads. We understand that PBSIM3
can simulate the raw multi-pass subreads, followed by PacBio's ccs
software to produce the final HiFi reads. We generated HiFi reads using this approach.
But there is a challenge regarding the "ground truth" for the output HiFi reads. The multiple subreads contributing to a single HiFi read can have slightly different base-by-base alignments (caused by different indels in them) to the chromosome where they are extracted. The differences are then resolved through consensus by ccs
, but ccs
doesn't give the base-by-base alignments for its output. Thus there isn't a single grand truth that allows unambiguous, base-by-base comparison against the mapping result of that HiFi read.
We then did some literature review. Recent journal articles that involve long-read aligner benchmarking either used the ccs
mode of PBSIM1
(LRA
) or ran PBSIM2
on its CLR
(Continuous Long Read) mode (Winnowmap2
and BLEND
) to simulate PacBio reads. We haven't found any article that used PBSIM2
or PBSIM3
to generate subreads, then ran ccs
to generate HiFi reads and had the grand truth.
Our aim is to accurately simulate PacBio HiFi reads, which has distinct characteristics and error profiles compared to CLR reads. For this, we still hope to use the most realistic models available for subread generation, in PBSIM3
, then create HiFi reads by ccs
. Is there any way that we can have grand truth for these HiFi reads?