Can FM Index be built in linear time?
2
3
Entering edit mode
8.3 years ago
Chen Sun ★ 1.1k

In short, my question is: 1. what is the running time of building FM-Index, is it linear in the sense of reference genome?

I was told that FM-Index can be built in linear time, i.e. O(n), where n is the length of reference genome. But I actually can not find the paper describing it.

The paper cited mostly "Indexing Compressed Text" by Ferragina et al. 2004 focus on the analysis of memory footprint and query time analysis, but not the running time of building FM Index.

FM-Index BWA Bowtie2 bowtie alignment • 3.4k views
ADD COMMENT
1
Entering edit mode

It's mentioned in the intro of Simpson and Durbin 2010. I haven't looked in detail, but presumably they link to something with a bit more analysis on the time complexity.

ADD REPLY
8
Entering edit mode
8.3 years ago
lh3 33k

In 2003, two (or maybe three) conference papers first showed that suffix array can be constructed in linear time. As generating BWT from suffix array takes linear time, they also proved that FM-index can be constructed in linear time. One of the most influential linear-time algorithms, SA-IS, was invented by Nong et al (2008). They provided a ~100-line implementation in C. It is much simpler and faster than the previous works. Yuta Mori optimized Nong et al's implementation into the sais library. A few years ago (when I was still following the literature), that library was fastest linear-time implementation in practical benchmarks. However, at least at that time, libdivsufsort, an O(nlogn) algorithm developed also by Yuta Mori, was widely believed to be the fastest open-source library to construct suffix arrays, faster than sais. Linear-time algorithms are not necessarily faster. If a linear algorithm is associated with a large constant and lots of cache misses, an O(nlogn) algorithm can be faster in practice, just as Rob said.

ADD COMMENT
0
Entering edit mode

I suppose BWA is also based on O(nlgn) algorithm, as analyzed in Li et al., 2014

ADD REPLY
0
Entering edit mode

Yes, read the bwa paper if you want to now more.

ADD REPLY
0
Entering edit mode

the running time analysis actually matters if we want to iteratively build FM-index (e.g. build FM-index 100 times by integrating genetic variants in each iteration) or dynamically build FM index

ADD REPLY
2
Entering edit mode
8.3 years ago
Rob 6.9k

Yes, there are a few linear construction algorithms. See, e.g. http://www.sciencedirect.com/science/article/pii/S030439750700477X and references therein. The algo is often written with a log Sigma factor, which goes away for constant sized alphabets like DNA. One practical note, the worst-case linear time algorithms are asymptotically optimal, but not always the fastest in practice on real data.

ADD COMMENT

Login before adding your answer.

Traffic: 2341 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6