Question

Book For String Matching Algos

6

Entering edit mode

13.9 years ago

Abhi ★ 1.6k

Hi All

I am routinely coming across challenges in string matching which makes me belv its time I need to refresh my string comparison algos specifically with DNA type strings.

Any good resource/book you can recommend ? I am looking for something which talks about the advance string matching stuff using compressed algos like BWT etc.

Best, -Abhi

next-gen sequencing • 4.9k views

ADD COMMENT • link updated 13.9 years ago by Martijn Van Iersel ▴ 570 • written 13.9 years ago by Abhi ★ 1.6k

Ram · Answer 1 · 2011-09-13

6

Entering edit mode

13.9 years ago

Pierre Lindenbaum 166k

alt text

Algorithms on strings M. Crochemore, C. Hancart and T. Lecroq Cambridge University Press

See also: http://www-igm.univ-mlv.fr/~lecroq/lec_en.html

alt text

Algorithms on strings, trees, and sequences: computer science and computational biology http://books.google.com/books/about/Algorithms_on_strings_trees_and_sequence.html?id=STGlsyqtjYMC

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.9 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

Gusfield Algorithms on strings, trees and sequences is by far the best one I have read on the subject. Thanks for the other recommendations. Will check'em out.

ADD REPLY • link 13.9 years ago by Arun 2.4k

score 4 · Answer 2 · 2011-09-13

The most classical book in this area is Dan Gusfield's Algorithms on strings trees and sequence. It is particularly famous for the comprehensive introduction to suffix trees and its various applications in biological sequence analysis. Another noteworthy data structure is Enhanced Suffix Array (ESA). It mimics most of the suffix tree operations in a memory efficient way. ESA is implemented in vmatch, seqan and a few others. ESA and suffix tree are closely related. Reading Gusfield's book helps to understand the ESA paper.

A less related field is compressed full-text indexing where the most outstanding result in my view is FM-index published in 2000. FM-index has a weak link to suffix trie/tree in that we can simulate a top-down traversal but not bottom-up traversal with FM-index. The key advantage of FM-index is that it typically has a much smaller memory footprint than ESA and suffix tree. This is why FM-index is more widely used for human sequence alignment although suffix tree and ESA are more versatile data structures in general.

However, I have not read any good books on FM-index. Most of books talking about FM-index, including the one suggested by GWW, are a little light. These books seldom cover more than the first FM-index paper (a more recent version is easier to read, though you may want to ignore the compression part). From the books, you may know how to do exact match with FM-index, but may not know how to write your own practical implementations. Furthermore, there have been quite a few advances in the past few years which help the development of the BWT-based short-read mappers/assemblers. For example, bwa and bwa-sw are based on a 2008 paper, SOAP2 and the published SGA based on a paper published in 2009 and bowtie uses results in a 2007 paper. Without these recent works in computer science, we may not have these BWT-based mappers/assemblers or may just have crippled tools. No books so far introduce these latest findings. You have to go through papers if you really want to implement something practical.

You actually do not need to have much knowledge on suffix tree and ESA to understand FM-index. You can start to study FM-index right away (I did that), though you may want to read Gusfield's masterpiece ultimately. Also, for a quick start, it is recommended to reuse others' code to construct FM-index/BWT (I did that, too). Constructing BWT is non-trivial, but of less research interest to biologists.

score 3 · Answer 3 · 2011-09-13

For the burrows-wheeler transform the best book I could find is this one here, it covers suffix trees, suffix arrays, and the burrows-wheeler transform; since they are all intimately related. It also covers advanced topics such as the FM-index.

The Burrows-Wheeler Transform :: Data Compression, Suffix Arrays, and Pattern Matching

by Donald Adjeroh, Timothy Bell, and Amar Mukherjee

alt text

Ram · Answer 4 · 2011-09-13

2

Entering edit mode

13.9 years ago

Eric Fournier ★ 1.4k

Recommend Your Favorite Bioinformatics Books has information about Bioinformatics books. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology is one of the books that is recommended. I remember reading parts of it and finding it enlightening.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.9 years ago by Eric Fournier ★ 1.4k

Ram · Answer 5 · 2011-09-13

2

Entering edit mode

13.9 years ago

Martijn Van Iersel ▴ 570

This website is probably a good place to start: http://www-igm.univ-mlv.fr/~lecroq/string/. It lists dozens of String-searching algorithms.

The book Algorithms by Robert Sedgewick is good, it contains a whole chapter on substring searching. There is also a website with excerpts and code.

alt text

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.9 years ago by Martijn Van Iersel ▴ 570