segemehl is a software to map sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to map primer or polyadenylation contaminated reads correctly. segemehl implements a matching strategy based on enhanced suffix arrays (ESA). Segemehl now supports the SAM format, reads gziped queries to save both disk and memory space and allows bisulfite sequencing mapping and split read mapping.
- adapter prediction and/or clipping
- mapping of single-end or paired-end data
- mapping with mismatches, insertions and deletions
- returning of all multiple mapping loci of one read (report only best scoring hits or all mappings with a set accuracy)
- multiple split read mapping (and downstream splice site detection)
- bisulfite mapping
- multithreading
For more information see: http://hoffmann.bioinf.uni-leipzig.de/LIFE/segemehl.html
Publication:
Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermueller J: "Fast mapping of short sequences with mismatches, insertions and deletions using index structures", PLoS Comput Biol (2009) vol. 5 (9) pp. e1000502
Interesting - I'm planning to do a "shootout" with aligners as well - I'll add this one to the list
I would be highly interested in the outcome! Let me know, once you have results!
What is the license of SEGEMEHL? GPL?
As far as I know, there is no license for segemehl yet. They just write: "...free software for non-commercial use..."
I've benchmarked Segemehl with BWA-MEM, Bowtie2 and MOSAIK and found that for datasets with a lot of variation it maps more reads and with significantly greater accuracy. However, this is using default parameters, and I found that BWA responds better than Segemehl optimising mapping sensitivity in reads with high variation. In fact it generally outperforms Segemehl in terms of looser definitions of accuracy while running faster and using less memory. Yet such parameter optimisation is a pain, and people generally run these tools using defaults. Segemehl is better out of the box at exactly calling indels, although is quite a lot slower. See figure below (using CuReSimEval strict mapping definition).
Did you also try to optimize segemehl for your dataset, or just BWA? ;)
Hi David, I've only just seen your reply - apologies! It was a while ago, but I did struggle with optimising Segemehl for our data through parameter sweeps. It just didn't seem to improve our results. It would be arrogant of me to suggest that our benchmark criteria were definitely not to blame for this result.
What parameters might you suggest for sensitively mapping high diversity (indel and mismatch) sequences?