Question

What is the most accurate short (ish) read aligner?

2

Entering edit mode

4.0 years ago

nanana ▴ 120

I have a very specific use-case for short read alignment that I was hoping the community would be able to help me out with.

I am working on a project to detect errors in short (< 1Kb) synthesised DNA. We are using a variety of approaches to sequence these molecules including PacBio Hifi and MiSeq, and we are sequencing them very deep (~ 20k - 200k X). I have played around with bwa mem and minimap2, and while these both perform very well generally, they generate different results to those obtained by manual alignment of individual reads to the reference sequence (presumably this comes at the cost of blistering speed).

For example, in comparing the alignment of the same read (using the below parameters):

enter image description here

And then comparing with a manual alignment using NCBI BLAST:

enter image description here

bwa mem produces the result that I would expect.

For aligning PacBio HiFi reads I use the following parameters:

bwa mem

bwa mem -x pacbio <ref.fa> \
  | samtools sort - \
  | samtools view -q 30 -Sb - > s1_bwa.bam

minimap2

minimap2 -ax asm20 --MD --cs --eqx <ref.fa> \
  | samtools sort - \
  | samtools view -q 30 -Sb - > s1_mm2.bam

Thus, I would like to know what the community would recommend in terms of most accurate read aligner, where speed is not at all a priority.

Is there a particular tool or algorithm that would be most appropriate?

aligner sequencing bwa minimap2 • 2.6k views

ADD COMMENT • link updated 4.0 years ago by Istvan Albert 102k • written 4.0 years ago by nanana ▴ 120

0

Entering edit mode

Rhe alignments that you indicate above are different only because your scoring matrices are different.

An alignment shows you an arrangement that maximizes the alignment score.

Both of your alignments have two bases missing, The question is simply how do the penalties add up. Do two consecutive gaps plus a mismatch produce a bigger score than two nonconsecutive gaps and no mismatch.

If the cost of opening a new gap is larger than that of extending an already opened gap + mismatch you will get the alignment shown as mm2. If not you will get the alignment labeled as mem (or blast).

But as I mentioned before and I will reiterate: it is not the aligner "accuracy" that causes this, it is the scoring matrix (scoring parameters that you are running the aligner with). You may have not explicitly set any, in which case the aligner chose some defaults.

The differences you see are because the default scoring matrices were different not because one aligner is "more accurate". All your alignments are optimal and correct, within the parameters that were set. It is just that the parameters were different in each case.

ADD REPLY • link 4.0 years ago by Istvan Albert 102k

0

Entering edit mode

please do not delete answered questions, it is considered rude. I would not answer a question that is then later deleted. I am answering to help people as well.

ADD REPLY • link 4.0 years ago by Istvan Albert 102k

score 0 · Answer 1 · 2021-03-31

You need to be more specific when you say:

I have played around with bwa mem and minimap2, and while these both perform very well generally, they generate different results to those obtained by manual alignment of individual reads to the reference sequence

What does "different" mean? Different in what way? Are you using the identical alignment parameters when aligning manually? The defaults across tools are usually wildly different.

No short-read aligner should produce a different alignment than what is optimal for that particular scoring matrix. A short read aligner will stop looking sooner, or shall we say searches in less thorough manner than an optimal one. Hence the speedup.

The primary difference between short-read aligners is how "hard" they try and when they give up, how many alternative alignments they report etc.

But the actual alignments, when found, should be identical (as long as you use the same alignment scoring).

score 0 · Answer 2 · 2021-03-31

In some ways this is going to depend on class of data you are working with. For long reads minimap2 is going to be champ. If you use it for reads shorter then 100 bp it may not fare so well since it was not designed for that type of data. For shorter Illumina reads most common aligners would likely to a reasonable job (on flip side these aligners may not be able to handle your PacBio reads). It may come down to specific features (e.g. ability to display all 100+ alignments for a read that may be multi-mapping equally well) that may only be possible with certain aligners.