I was wondering if anybody can explain how the FM-index was created for multiple entries of a sequence file (say, fasta format for a reference genome containing multiple chromosomes). In other words, how does the FM-index distinguish different sequences in a single fasta file?
While I was studying the burrows wheeler transformation (BWT) and FM-index on sequence manipulation, I seem to understand the idea of the BWT and FM-index for single string/sequence.
1) All the codes from github I can find are using single string/sequence as examples including the original FM index paper.
2) It seems to me a common strategy is to concatenate the sequences to get a single one, which brings to my original question: how does the FM-index distinguish the original entries? Remember the offset when they are joined?
3) The two programs bwa and bowtie2 are two complicated for me to
understand the details.
I am trying to understand the implementation of FM-index in C/C++ to create the FM-index for sequences/strings. Thanks in advance.
In the past week I read some papers and notes. FM index contains 1) First column of the BWT transformed matrix (F column) which is lexicographically sorted, so only need A/C/T/G four characters; 2) The last column of BWT matrix (L column) which is the same length of the reference genome string(s). 3) The suffix array SA[] (or, only a sample of the SA[] to save space, from which the full SA[] can be calculated on-fly(?)); 4) Checkpoints table (occurrences/rank/tally table) of each character (A/C/G/T). The L column string can be compressed using wavelet tree or otherdata structures etc, but I am not sure. Here is a code fragment of the fm-index data structure from github I found:
Two questions that maybe more specific: 1) How the index is created for the genome (To simplify the process, only consider one string T at this moment)? 2) How the search is implemented against the FM-index from 1)? For example: given pattern P=ACGTCACA, how the index is searched for P.
Thanks a lot!
Regarding your first question --- there are many different algorithms for generating the BWT+FM-index. Different tools will make use of different algorithms. One approach is to build the suffix array (again, there are many algorithms for this like DC3, and other suffix sorting algorithms; this one is very popular). From the suffix array, one can immediately extract the L column of the BWT since there is a trivial correspondence between them (see equation 1 on the second page 2 of Ben's notes here). Regardless, once you have the SA and the BWT, computing the rank table is straight forward. In reality, there are different approaches that trade off construction speed for construction memory etc. Conceptually, however, building the suffix array and then deriving everything you need from it is straightforward.
Regarding your second question, you search for a pattern P in the BWT using "backward search". Ben Langmead has some brilliant slides illustrating the idea; look at slides 26 and on here. You basically start from the end of the string, and try to find increasing length suffixes of the pattern that appear in your text. The FM index lets you extend the length of the suffix you search in constant time per-character. If you can extend the suffix length to be all of P, you've immediately found that your pattern appears in the text, and the range of BWT rows in which it appears (when paired with your suffix array) let you enumerate those actual occurrences.
Hi Rob! Thank you so much for your time answering my tedious questions. I seem to understand the algorithms described in Ben Langmead's notes you mentioned, and I should be more specific with my questions which were actually on the details of implementation. I am studying the C code at this moment trying to understand each step in detail.
Using above fm_index struct as example, my questions can be:
Thank you very much again for your help!