Hash Table Vs Suffix Array Alignment Error Tolerance
1
1
Entering edit mode
10.8 years ago
stolarek.ir ▴ 700

Can anyone in few short words (or links to some resources) tell me something on why hash tables are usually better than FM-index aligners in terms of error tolerance (however now this changed, and bwa-sw is masively robust for errors).

I know how it looks like in the results, but, from the algorithmic point of view, what determines, that hash table is better/worse than suffix array, suffix tree ?

Thanks for any help

bwa • 3.3k views
ADD COMMENT
4
Entering edit mode
9.4 years ago
Rob 6.9k

There's likely little-to-no-difference based on the indexing structure. Rather, differences you see are more likely based on the search algorithms used over these structures. For example, a BWT / Suffix Array can be used as an exact k-mer hash (for arbitrary k). That is, these full-text indices allow finding all occurrences of an exact pattern a an arbitrary length (if you set that arbitrary length = k, you have effectively the same info as a k-mer hash). However, the tools based on these indices can work in massively different ways. For example, the early BWT-based tools (e.g. Bowtie) attempted to search for "anchors" for a read that were within a specified hamming distance of the first x bases of the read (I think 22 by default). You could imagine situations in which the first 22 bases contained too many errors, but a k-mer based approach that sought to find a seed anywhere in the read would be successful. However, these differences in strategy are not mandated by the underlying data structures, and you could certainly try and find seeds anywhere in the read using a BWT / Suffix array / FMD index. In fact, this is what BWA-mem does --- it looks for potential mapping locations for a read by finding maximal exact matches (or super-maximal exact matches) between the read and the reference. These MEMs (SMEMs) can occur anywhere in the read.

There are other tradeoffs between full text indices and k-mer indices. K-mer indices tend to be larger, but potentially faster (better caching behavior in many cases) --- however recent tools such as HISAT show how we can make full text indices more cache efficient to improve their practical performance. However, k-mer hash tables are, in a sense, more brittle. If you build the table with e.g. k = 31, this may be inappropriate for mapping very short reads, and the only recourse is to re-index with a smaller k (while the full-text index need not be rebuilt). Further, they approaches are not disjoint, and great speed gains can be achieved in e.g. suffix array search, by augmenting the suffix array with a k-mer hash table for a small, fixed k.

ADD COMMENT

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6