Here is the number alignments of five 31bp reads reported by 5 aligners:
Sequence bwa -R10000/GEM -s1 MrFast -e2 bowtie -l31 bowtie2
--------------------------------------------------------------------------------------------------------
AGGGCCAGCTAAATTTTGTATTTTTAGAAGA 27 2-mismatch 21 2m 27 2m 1 2m; 1 3m
GTCAGTATGGCCAATTTCACGATAATGATTC 50 2-mismatch 30 2m 50 2m unmapped
TTTGATTCTTCTGTCTTTTCTTCTTTAATAG 103 1m; 4119 2m 60 1m; 2053 2m 103 1m; 4098 2m 103 1m; 24 2m
GCCCCCTGAGCAGCTGGGATTACAGGCGAGC 10 1m; 121 2m 10 1m; 115 2m 10 1m; 73 2m 10 1m; 7 2m; 17 3m
ATCTTTCTCTTCTTATTTTTGTATGAATATA 10 1-del 10 1d unmapped 10 1d
Only BWA and GEM report all the hits up to 2 mismatches/indels. Bowtie1 finds the best hits, but not gapped hits or suboptimal hits. Bowtie2 and MrFast may miss 2-mismatch hits. On chr1, segemehl does not output suboptimal hits. It finds all the 1-difference hits but misses two 2-mismatch hits. I have not tried segemehl on whole-genome as it is inconvenient to run.
I have also tried bwa, bowtie and gem at a larger scale. For ~4200 31bp reads having 2 mismatches, bwa and gem report exactly the same number of hits for every read. Bowtie on the other hand misses hits for 25% of reads - bowtie does not guarantee to find all hits. If I use bwa and gem to find 2-difference hits, bwa finds more, suggesting that gem does not guarantee to find all gapped alignments. Note that I am not saying that bwa can find all 2-difference hits. In theory it should, but in practice, optimization to the original algorithm may lead to bugs in corner cases, which I have never checked.
In summary, among the mappers tested here, bowtie, bowtie2, mrfast and segemehl do not guarantee to find all the 2-mismatch hits for ~30bp reads. Only bwa and gem achieve full-sensitivity to mismatches. Bwa is more sensitive than gem to gapped alignments. Nonetheless, at the full sensitivity, gem works with 100bp reads, but bwa has to use heuristics. Gem is a better mapper than bwa (IMHO gem is the first mapper better than the original bwa all around), as long as you use it the right way.
EDIT: for gapped alignment, bwa may produce duplicated hits (a bug I thought I have fixed...). That said, gem does not guarantee to find all gapped hits. I have found one example with the best match contains 1 gap, while gem only reports 2-mismatch hits.
EDIT2: Conclusions (so far):
With the suggestions from nkrumm and David, we can say that bwa, gem, mrfast, segemehl and RazerS3 probably guarantee to find hits within certain hamming distance threshold. These mappers all support gapped alignment, but whether they guarantee to find all gapped hits given an edit-distance threshold has not been concluded. Neither bowtie nor bowtie2 guarantees to find all hits.
For practical uses, bwa's try-and-reject strategy is probably faster for <32bp sequences. MrFast, gem and razers3 all use the pigeon hole approach. They should be faster for longer reads (e.g. 36-100bp). BWA and gem index the genome. They have fixed memory footprint and is easier to run (you do not need to split the read file when it is too large), and also run much faster if you only have a few to a couple of thousands of reads at hand. I would recommend them in general.
All that being said, for 100bp reads, finding all hits given a certain edit-distance threshold is not sufficient. We would like to see if alignments containing one reasonably long gap (say 7bp insertion) and to rank the hits by affine gap penalty instead of by edit distance. In computer science, edit distance is popular as it is clearly defined and many efficient algorithm exist to find matches. However, affine gap penalty is of more biological/evolutionary meaning.
This article compares 9 mappers and describes their algorithms. It doesn't deal with indels and paired reads (for the moment) but it's worth reading. Their pipeline is available here.
This is a great read. Thanks for posting it.
unrelated, but are you going to add GEM to the mapper ROC page?
It is not that simple. Firstly, I need to simulate more reads. I have seen fluctuations in case of gem. Secondly, I need to write a proper gem-to-sam converter for paired-end reads. I also need to figure out how to squeeze the best out of gem given PE reads. Thirdly, although gem uses a better algorithm, there are quite a few things to do to make it a drop-in replacement of bwa/bowtie/etc. I am a little concerned with including a mapper which has potentials but is not usable as a typical mapper yet (EDIT: by typical I mean in the context of variant calling). All that said, I should emphasize that the gem algorithm is better for >100bp reads. In principle, I can take its single-end output and generate a more accurate bwa-flavored SAM with halved total CPU time in comparison bwa. PS: most short-read mappers do not collect enough information to be converted to bwa-flavored alignments - they ignore suboptimal hits for speed. Gem is one of the few that make this possible.
Just out of curiosity, what keeps GEM from being a "typical mapper" in the contest of variant calling? Is it really just lack of a good gem-2-sam converter, or is there something inherent in the algorithm?
Mostly about the implementation. From its paper, the developers might be thinking mapping quality is not of much use, so gem does not compute mapping quality. For single-end, I can estimate good mapq using the bwa method, but it is hard to do the same for PE reads as it outputs less information. Also at my hand, its PE mapping is less accurate than SE, but it should not be so given that gem is clearly better than bwa for SE (that is why I said I have not got the best from gem for PE). In addition, gem does not perform affine-gap alignment and does not left-align gaps. You need a separate step to refine CIGAR. On its algorithm, a potential concern is that it requires full-length match in its current implementation. When reads get longer and longer, local alignments like bowtie2 or bwa-sw is preferred. Another caveat is the hard edit-distance threshold on high-quality bases (it allows many low-quality mismatches). By default, it is 4%. If there is an indel longer than 4bp or there are 5bp mismatches on a 100bp read, gem will not map it. Such reads only constitute a tiny fraction of all reads - the mapping sensitivity is barely affected, but you will consistently lose alignments in highly divergent regions (Gerton first made me think this is important). To get these divergent reads mapped, you need to increase -e/-m if you do not want to do a multi-phase alignment, but in that case, gem will run slower. I could be wrong on the last point. It would be more appropriate to evaluate human variant calls, but the lack of correct left-aligned SAM makes such an experiment hard, for now.
Hi, recently, I have a problem about GEM. When I want build index for human using GEM, it reported
The command I used is
Could you help me to solve this problem?
Thanks!
New ROC are here (PDF).
thanks! BWA still looking good.
That is the new algorithm released today. The original BWA falls short for single-end. I am still worrying that I have not used gem properly for paired-end.
Sorry for being dumb, but where is the mapper ROC page?
http://lh3lh3.users.sourceforge.net/alnROC.shtml
Thanks brentp!!
Hi lh3, can you add the parameters used for each tool and a description of your inputs (and how you generated them)? This would help a great deal with reproducing your tests.
One comment to segemehl: was the -H option set to 0? (-H, --hitstrategy report only best scoring hits (=1) or all (=0) (default:1))
What about adding STAR?
Well, there are just so many mappers. I am sure if you go through the 40-ish papers, you will find more mappers that are claimed to find all k-mismatch/k-difference hits. It is hard to cover them all. My initial goal is to focus on the most popular mappers. Gem is actually not very popular right now, but I think it has a great potential.
You mention that segemehl has "several" practical concerns with most of its functionality covered by other mappers. Can you please go into some more detail?
EDIT: in short, segemehl: 1) requires too much memory; 2) is very inconvenient to run if we separate the genome for a smaller memory footprint; 3) does not output the best alignment with a proper mapping quality, which is required by many variant callers.
The huge memory footprint of segemehl is actually the main drawback (that's not new). But most collaborators I work with have machines with more than 50 GB of main memory. Actually, I think that main memory is no longer a problem nowadays. At least for big institutes. Nevertheless, you are completely right. The needed memory is huge and segemehl is not a good choice to run on your personal computer. When reading your practical concerns (splitting the genome to chromosomes), I'm absolutely with you. Don't use segemehl like that... you WILL go nuts.
Will 64GB be enough? I read from the segemehl manual that you need 6GB for chromosome 1. That is 1/10 of the human genome. In that trend you would need 60GB for segemehl. With disk cache and a couple of other GB jobs on the machine, 64GB might sound pushing the limit.
But despite the memory problem, why would you not recommend segemehl for Illumina reads? That is actually a strong assertion, so please explain!
It does not produce SAM containing the best alignments with proper mapping quality. This is the typical input for most variant callers. For a mapper using heuristics, it is non-trivial to derive good mapping quality. In addition, requiring 50GB memory alone is a problem big enough for me to discourage its uses. Not every lab can access those huge memory machines. At the same time, the biggest few institutes use huge clusters, with each computing node accessible to limited shared memory. I would not recommend something that use unnecessary memory for results achievable by many other mappers.
Did you test Mosaik?
This was a while ago, but if people are still interested in this question, try vmatch. It's a general-purpose aligner that can handle this situation for both hamming and edit distance.