I have a very specific use-case for short read alignment that I was hoping the community would be able to help me out with.
I am working on a project to detect errors in short (< 1Kb) synthesised DNA. We are using a variety of approaches to sequence these molecules including PacBio Hifi and MiSeq, and we are sequencing them very deep (~ 20k - 200k X). I have played around with bwa mem and minimap2, and while these both perform very well generally, they generate different results to those obtained by manual alignment of individual reads to the reference sequence (presumably this comes at the cost of blistering speed).
For example, in comparing the alignment of the same read (using the below parameters):
And then comparing with a manual alignment using NCBI BLAST:
bwa mem produces the result that I would expect.
For aligning PacBio HiFi reads I use the following parameters:
bwa mem
bwa mem -x pacbio <ref.fa> \
| samtools sort - \
| samtools view -q 30 -Sb - > s1_bwa.bam
minimap2
minimap2 -ax asm20 --MD --cs --eqx <ref.fa> \
| samtools sort - \
| samtools view -q 30 -Sb - > s1_mm2.bam
Thus, I would like to know what the community would recommend in terms of most accurate read aligner, where speed is not at all a priority.
Is there a particular tool or algorithm that would be most appropriate?
Rhe alignments that you indicate above are different only because your scoring matrices are different.
An alignment shows you an arrangement that maximizes the alignment score.
Both of your alignments have two bases missing, The question is simply how do the penalties add up. Do two consecutive gaps plus a mismatch produce a bigger score than two nonconsecutive gaps and no mismatch.
If the cost of opening a new gap is larger than that of extending an already opened gap + mismatch you will get the alignment shown as mm2. If not you will get the alignment labeled as mem (or blast).
But as I mentioned before and I will reiterate: it is not the aligner "accuracy" that causes this, it is the scoring matrix (scoring parameters that you are running the aligner with). You may have not explicitly set any, in which case the aligner chose some defaults.
The differences you see are because the default scoring matrices were different not because one aligner is "more accurate". All your alignments are optimal and correct, within the parameters that were set. It is just that the parameters were different in each case.
please do not delete answered questions, it is considered rude. I would not answer a question that is then later deleted. I am answering to help people as well.