Entering edit mode
14.2 years ago
Jeremy Leipzig
22k
Which aligners (long and short read) behave differently when parts of a reference/target sequence are "soft-masked", i.e. have portions in lowercase to designate repeat regions?
Masking has never been perfect and probably will never be perfect. This will lead to wrongly mapped sequences, spurious SNPs/indels calls and all sorts of problems. I cannot think a single use case when masking may lead to better outcomes. Trust me. Do not mask.
Yup. Do not mask. You get the most accurate alignment when you align to what is actually there. What you do not want are reads that really belong to repetitive regions being forced to align to the wrong place because you didn't provide the correct sequence for the read to align to.
bwa does not care about lowercase nucleotides.
What will be a difference, except for paired-ends or spliced mappings?
so I assume BWA does not care about lowercase nucleotides?
BWA always uses all bases in alignment. Again, do not mask, unless you want to play with troubles.