Hello,
The default seed length for an exact match is 19 (bwa-mem2 -k parameter). Now, let's say I have 250 bp PE illumina reads, and my genome has 5% heterozygosity, five percents. That would make one mismatch every 12.5 base (0.05 x 250), which is below the value of minimal seed length. And should I touch the -B parameter (mismatch penalty?)
Would it be then desirable to reduce the -k parameter, to, 10, for example, or will bwa somehow notices the heterozygosity and handle it properly? With default parameters I have a mapping rate ~85% (but it is possible a significant portion of reads in my fastq do not belong to my reference genome because it's a non model organism and we have no way to effectively eliminate bacteria from our samples). I f you ask me "but dude, just tries out", it's because I don't have infinite computer resources access + even if the mapping rate increases, if it increases completely random and meaningless alignments, this doesn't help me.
what do you think? It's difficult for me, to understand how internally bwmem2 handles such situation.
I will appreciate any feedback.
Thanks everyone
BBMap allows shorter seed kmers; default is 13 and you can set it lower (I don't really recommend going below 10 though). It has no trouble with 95% identity.