My first post on here.........
I am using BWA-mem in quite an unusual situation, I am aligning small sequences of around 150bp to small reference sequences of variable length. When I give a reference file with multiple similar reference sequences my overall alignment rates increase considerably.
ref1.fasta:
>seq1
CAGGCTCTGCTCTTCATAATCATACCTTTGTGACTCAGGATGCTGT
>seq2
CAGGCTCTGCTCTTAATATCTGGCCGTCGTATTCCACCTCTGCGACTCATGATGCTGT (100,000 aligned)
>seq3
CAGGCTCTGCTCTTCATAATTTCTATCTTGCCCACCCTACTCGACACAGAGCAAAAATCCAACACTCCCAATATTGCCGTGGCTTCGACCTCTTGCTCAGATTTTCTTGTTACCTTTGTGACTCAGGATGCTGT
>seq4
CAGGCTCTGCTCTTCATAACCCTCCCTGCGAGTCCTTAAGTCTGACTCGGATCCTTAAACAACCTTTTCTTACCTTTGTGACTCAGGATGCTGT
ref2.fasta:
>seq2
CAGGCTCTGCTCTTAATATCTGGCCGTCGTATTCCACCTCTGCGACTCATGATGCTGT (25,000 aligned)
My fastq files align at higher numbers to ref1.fasta than ref2.fasta, but allow a far greater number of deletions and mis-matches with ref1.fasta.
I realize this is not what BWA-mem was really designed to do, but would be really grateful if you could help explain this activity, could it be something to do with the initial seeding of the alignment?
Many thanks, Steve W.
Are those 75k reads longer than seq2?
Likely since OP says
Yes my reads are around double the length of seq2, with a maximum of 150bp.
May be of interest: https://jeremy9959.net/Blog/TheMEMinBWAMem-fixed/
Thank you, I think I came across this one when I was searching for an answer, I'm fairly sure the seeding has something to do with this strange behavior, but I'm yet to pin-point exactly what the cause is.