Hi all,
I've been testing the alignment of some PacBio reads using pbmm2
and Minimap2
and observed a phenomenon that I can't make sense of.
With both tools, I'm frequently seeing the same reads producing many supplementary alignments (as many as 93 in one case!) in which it appears that many of the same bases of the original read are being reused in the supplementary alignments. For example, here are the CIGAR strings from some pbmm2
alignments:
m84137_250219_010828_s1/255856922/ccs/rev 0 chrA 3219 60 1040S602=1I595=1I931=1X11=3162S
m84137_250219_010828_s1/255856922/ccs/rev 2064 chrA 3219 60 1041S185=1I417=1I1517=3182S
And here are some from a Minimap2
alignment:
m84137_250219_010828_s1/261165604/ccs/fwd 0 chr14 77485520 60 2047S748M4I976M4D459M
m84137_250219_010828_s1/261165604/ccs/fwd 2064 chr14 77485651 60 2178H617M4I976M4D459M
In both cases, it looks like a significant portion of the original read is being re-used in each alignment. In my application, this makes it difficult to accurately quantify the unique read (and base) counts that align to different parts of the genome. (And no, I can't just ignore the supplementary alignments, unfortunately).
As I was trying to figure this out, I came across a section of the pbmm2
GitHub documentation on "repeated matches trimming" that pbmm2
is supposed to perform by default. The documentation states "we see pbmm2 as a blasr replacement and require that a single base may never be aligned twice". The documentation goes on to describe the pbmm2
repeated matches trimming approach as follows: "[w]e align the read, find overlapping query intervals, and trim non-primary alignments in order as provided by minimap2; trimming here means that pbmm2 rewrites the cigar and the reference coordinates on-the-fly."
To me, this seems at odds with what I am observing.
I then created this issue on the Minimap2
page to try to make sense of things. Heng Li told me that in my example, "the overlap length is tiny on the query", that the alignments "are on different strands", and that this "behavior is expected and desired".
To me, the overlap length seems substantial on the query, and this behavior is neither desired nor expected. I assume I'm misunderstanding something but in my thinking, if the same part of a query sequence aligns to multiple places in the reference genome (on either strand), the best-scoring alignment should be called the primary alignment and all others should be labeled secondary alignments.
Can someone help me understand what I'm missing here?