I would like to visit the issue of novel SV junction finding and associated MAPQ calculation,
extending discussions here:
https://www.biostars.org/p/9551217/
https://www.biostars.org/p/9581263/
As I understand its operation, I think vg giraffe
output could be improved if it
could take non-overlapping read alignments into account, i.e., to make
a distinction between secondary and supplemental alignments (split reads) in a similar
fashion as bwa, minimap2, and GraphAligner. Can it?
Secondary Alignment #1
desired default behavior: suppress lower quality path2 alignment, adjust MAPQ as needed
current default behavior: same
path1 --------------------
||||||||||||||||||||
read --------------------
||| ||| || |||| |
path2 --------------------
Secondary Alignment #2
desired default behavior: pick one equal alignment, set MAPQ to 3 (phred 0.5)
current default behavior: same
path1 --------------------
||||||||||||||||||||
read --------------------
||||||||||||||||||||
path2 --------------------
Supplemental Alignment, a novel SV junction not found in graph
desired default behavior: report both alignments with high MAPQ
current default behavior: only one alignment reported with low MAPQ
path1 --------------------
||||||||||
read --------------------
||||||||||
path2 --------------------
Supplemental Alignment, as above
desired multimaps behavior: same as above
current multimaps behavior: two alignments typically reported, with low MAPQ
path1 --------------------
||||||||||
read --------------------
||||||||||
path2 --------------------
For the split/supplemental alignments, some tolerance for junction microhomology should be tolerated before rejecting ~equal scoring alignments as secondary and reducing MAPQ.
My rationale:
graph alignment is still superior to linear reference alignment even when a goal is finding novel junctions (not genotyping known SVs).
people transitioning from linear to graph aligners are used to the distinction between secondary and supplemental alignments for good reason - they are distinct concepts based on overlap assessment.
as I drew them, there is high confidence that the mapping assignment is correct in all cases except Secondary #2; the Supplemental alignments are unambiguous and should retain high MAPQ.
"short" reads aren't always so short any more, my work is using >300bp reads where many novel junctions can be detected (and
vg girafe
has some long read support, although I have not used it yet).I can probably recover most alignment information using
--max-multimaps
, but must write my own parser to distinguish between secondary and supplemental and to adjust MAPQ, which could be handled during alignment.the additional processing should be minimal since a simple check of most primary alignments will quickly reveal them to be end-to-end, so a next-best alignment must be secondary; only a small fraction of reads would need extended overlap parsing.
I'm happy to be taught if there are ways to handle split alignment I am not understanding!
The exact behavior of MAPQ on the primary reported alignment vs. a next-best supplemental alignment can be variable - sometimes it stays high when the junction is not in the center of the read, but it can be low when the junction is central, and the supplemental MAPQ is always set to 0, making MAPQ unreliable.
I agree that this behavior would be beneficial, but
vg giraffe
currently does not support it. I think it's technically feasible, but it would take some focused algorithm development.