Entering edit mode
11.7 years ago
Manuel
▴
410
The SeqAn team is happy to announce the 1.4 Release of the SeqAn apps and libraries. This update includes all improvements and changes over the last two years since the 1.3 release. You can download the release packages at http://packages.seqan.de.
Some highlights are:
- New read mapping programs Masai (backtracks two string tries in parallel) and RazerS 3 (uses q-gram index filters and OpenMP).
- An efficient implementation of an FM index that combines good performance with low memory usage and can be iterated as a prefix trie via the string tree iterator interface.
- A new command line parser that facilitates the integration of programs into workflow engines such as KNIME or Galaxy.
- An extended and more robust I/O system, namely in the modules seq_io, gff_io, bam_io.
You can find a list of many more updates in our Trac wiki in the Changelog.
If you are upgrading from a previous release, you can learn about the new library content in the updated tutorial. In case of questions or support, please send an email to our mailing list.
Have fun!
The SeqAn Team
Great! SeqAn is undoubtedly the most comprehensive library for sequence analyses. A few questions though: what data structure are you using to keep FM-index? Wavelet tree? Is it entropy bounded? What is the more precise space complexity? Does it work over the ASCII alphabet or just the DNA alphabet? I am thinking about these questions, but I haven't found satisfactory solutions so far. Thanks in advance.
Hi there,
you are right, the FM index is based on a wavelet tree implementation. In contrast to the "Alphabet-Friendly FM-index" from Ferragina, P., Manzini, G., Mäkinen, V., & Navarro, G. (2004). An Alphabet-Friendly FM-index. (A. Apostolico & M. Melucci, Eds.) String Processing and Information Retrieval, 3246, 150–160. we only use one wavelet tree to store the whole BWT.
The wavelet tree is similar to a Huffman tree and balanced by the total occurrences of each character and requires overall H_0(T)(n + o(n)) bits in total.
The wavelet tree is a binary tree in which each node holds a rank support bit string as well as a pivot character. The rank support bit string consists of 3 strings, namely a bitString, a blockString and a superBlockString. The bitString consists of 0 and 1, the blockString tells how many 1 are present from the last superBlock till the current position and the superBlock tells how many 1 are there from the beginning until the current superBlock. In doing so we can make use of the 4 Russian Trick and compute the rank of any position in the rank support bit string in constant time.
Per default the block of the rank support bit string consists of 64 bit such that the overall memory consumption for a rank support bit string is:
bitString: n bits blockString: n / 64 * 16 bits superBlockString: n / (64 * 64) * 32 bits
In the worst case scenario all characters are equally distributed and the consumption is as follows:
For an alphabet size of 4 you have exactly 3 rank support bit strings in the wavelet tree with an overall length of 2n which consume (2 + 1/2 + 1/64)n bits, i.e. 2.515625 bits per text character Yes, you can use it with ASCII as well. Then it consumes 10.0625 bits per text character
By the way, there is also another simpler implementation which does not use the wavelet tree, but a rank support bit string for each character to represent the BWT. This version obviously consumes more space, but it is also much faster.
In order to retrieve the occurrences of some given pattern you also need a suffix array. Our suffix array is sparse and consist of a rank support bit string and a string of ints that stores suffix array entries of the "real" suffix array. In that case the rank support bit string indicates which of the original values are stored. Missing elements can be recomputed using the BWT. Depending on the compression rate this data structure requires more or less space.
For more details I can post a link to my Master's thesis and of course feel free to ask questions.