I've come across an peculiarity when aligning a 19bp sequence (AGCATGGGGAGCTCCCGGG
) to the human reference genome (hg38
) using bwa aln
.
When aligning with 6 mismatches allowed (-n 6
), I obtained a smaller number of hits compared to an alignment with 5 mismatches allowed (-n 5
).
I used the following configurations for the alignments:
bwa aln -n 6 -k 0 -l 100 -o 0 -N hg38Index toyExample.fastq > out_n5.sai
bwa aln -n 6 -k 0 -l 100 -o 0 -N hg38Index toyExample.fastq > out_n6.sai
where toyExample.fastq
is a dummy fastq file containing my sequence:
@AGCATGGGGAGCTCCCGGG
AGCATGGGGAGCTCCCGGG
+AGCATGGGGAGCTCCCGGG
~~~~~~~~~~~~~~~~~~~
I set the seed length to a large number to ignore seed-specific constraints.
Anyone has any idea what's going on? I would expect all the hits stored in out_n5.sai
to be a subset of the hits stored in out_n6.sai
, but that's not the case. This phenomenon does not happen for smaller number of mismatches (that is all hits found for -n 4
mismatches are also found with -n 5
mismatches, for instance).
Wildly guessing, but could 6 mismatches result in more multimappers, rendering the sequence as unmapped. Please define how you determined that a sequence was mapped (or not), or rather what a "hit" is?