Question

Hash Table Vs Suffix Array Alignment Error Tolerance

1

Entering edit mode

10.8 years ago

stolarek.ir ▴ 700

Can anyone in few short words (or links to some resources) tell me something on why hash tables are usually better than FM-index aligners in terms of error tolerance (however now this changed, and bwa-sw is masively robust for errors).

I know how it looks like in the results, but, from the algorithmic point of view, what determines, that hash table is better/worse than suffix array, suffix tree ?

Thanks for any help

bwa • 3.3k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 10.8 years ago by stolarek.ir ▴ 700

Ram · Answer 1 · 2015-07-24

There's likely little-to-no-difference based on the indexing structure. Rather, differences you see are more likely based on the search algorithms used over these structures. For example, a BWT / Suffix Array can be used as an exact k-mer hash (for arbitrary k). That is, these full-text indices allow finding all occurrences of an exact pattern a an arbitrary length (if you set that arbitrary length = k, you have effectively the same info as a k-mer hash). However, the tools based on these indices can work in massively different ways. For example, the early BWT-based tools (e.g. Bowtie) attempted to search for "anchors" for a read that were within a specified hamming distance of the first x bases of the read (I think 22 by default). You could imagine situations in which the first 22 bases contained too many errors, but a k-mer based approach that sought to find a seed anywhere in the read would be successful. However, these differences in strategy are not mandated by the underlying data structures, and you could certainly try and find seeds anywhere in the read using a BWT / Suffix array / FMD index. In fact, this is what BWA-mem does --- it looks for potential mapping locations for a read by finding maximal exact matches (or super-maximal exact matches) between the read and the reference. These MEMs (SMEMs) can occur anywhere in the read.

There are other tradeoffs between full text indices and k-mer indices. K-mer indices tend to be larger, but potentially faster (better caching behavior in many cases) --- however recent tools such as HISAT show how we can make full text indices more cache efficient to improve their practical performance. However, k-mer hash tables are, in a sense, more brittle. If you build the table with e.g. k = 31, this may be inappropriate for mapping very short reads, and the only recourse is to re-index with a smaller k (while the full-text index need not be rebuilt). Further, they approaches are not disjoint, and great speed gains can be achieved in e.g. suffix array search, by augmenting the suffix array with a k-mer hash table for a small, fixed k.