Book For String Matching Algos
5
6
Entering edit mode
13.2 years ago
Abhi ★ 1.6k

Hi All

I am routinely coming across challenges in string matching which makes me belv its time I need to refresh my string comparison algos specifically with DNA type strings.

Any good resource/book you can recommend ? I am looking for something which talks about the advance string matching stuff using compressed algos like BWT etc.

Best, -Abhi

next-gen sequencing • 4.3k views
ADD COMMENT
6
Entering edit mode
13.2 years ago

alt text

Algorithms on strings M. Crochemore, C. Hancart and T. Lecroq Cambridge University Press

See also: http://www-igm.univ-mlv.fr/~lecroq/lec_en.html

alt text

Algorithms on strings, trees, and sequences: computer science and computational biology http://books.google.com/books/about/Algorithms_on_strings_trees_and_sequence.html?id=STGlsyqtjYMC

ADD COMMENT
1
Entering edit mode

Gusfield Algorithms on strings, trees and sequences is by far the best one I have read on the subject. Thanks for the other recommendations. Will check'em out.

ADD REPLY
4
Entering edit mode
13.2 years ago
lh3 33k

The most classical book in this area is Dan Gusfield's Algorithms on strings trees and sequence. It is particularly famous for the comprehensive introduction to suffix trees and its various applications in biological sequence analysis. Another noteworthy data structure is Enhanced Suffix Array (ESA). It mimics most of the suffix tree operations in a memory efficient way. ESA is implemented in vmatch, seqan and a few others. ESA and suffix tree are closely related. Reading Gusfield's book helps to understand the ESA paper.

A less related field is compressed full-text indexing where the most outstanding result in my view is FM-index published in 2000. FM-index has a weak link to suffix trie/tree in that we can simulate a top-down traversal but not bottom-up traversal with FM-index. The key advantage of FM-index is that it typically has a much smaller memory footprint than ESA and suffix tree. This is why FM-index is more widely used for human sequence alignment although suffix tree and ESA are more versatile data structures in general.

However, I have not read any good books on FM-index. Most of books talking about FM-index, including the one suggested by GWW, are a little light. These books seldom cover more than the first FM-index paper (a more recent version is easier to read, though you may want to ignore the compression part). From the books, you may know how to do exact match with FM-index, but may not know how to write your own practical implementations. Furthermore, there have been quite a few advances in the past few years which help the development of the BWT-based short-read mappers/assemblers. For example, bwa and bwa-sw are based on a 2008 paper, SOAP2 and the published SGA based on a paper published in 2009 and bowtie uses results in a 2007 paper. Without these recent works in computer science, we may not have these BWT-based mappers/assemblers or may just have crippled tools. No books so far introduce these latest findings. You have to go through papers if you really want to implement something practical.

You actually do not need to have much knowledge on suffix tree and ESA to understand FM-index. You can start to study FM-index right away (I did that), though you may want to read Gusfield's masterpiece ultimately. Also, for a quick start, it is recommended to reuse others' code to construct FM-index/BWT (I did that, too). Constructing BWT is non-trivial, but of less research interest to biologists.

ADD COMMENT
0
Entering edit mode

@lh3 : Thanks for a very well written advice. Could you please tell where I could get the code snippet for the FM-index/BWT algo. I am asking without any google search coz I belv you would know it precisely.

ADD REPLY
3
Entering edit mode
13.2 years ago
Gww ★ 2.7k

For the burrows-wheeler transform the best book I could find is this one here, it covers suffix trees, suffix arrays, and the burrows-wheeler transform; since they are all intimately related. It also covers advanced topics such as the FM-index.

The Burrows-Wheeler Transform :: Data Compression, Suffix Arrays, and Pattern Matching

by Donald Adjeroh, Timothy Bell, and Amar Mukherjee

alt text

ADD COMMENT
2
Entering edit mode
13.2 years ago
Eric Fournier ★ 1.4k

Recommend Your Favorite Bioinformatics Books has information about Bioinformatics books. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology is one of the books that is recommended. I remember reading parts of it and finding it enlightening.

ADD COMMENT
2
Entering edit mode
13.2 years ago

This website is probably a good place to start: http://www-igm.univ-mlv.fr/~lecroq/string/. It lists dozens of String-searching algorithms.

The book Algorithms by Robert Sedgewick is good, it contains a whole chapter on substring searching. There is also a website with excerpts and code.

alt text

ADD COMMENT

Login before adding your answer.

Traffic: 2584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6