I'm confused by how an FM-index is built from a genome with multiple chromosomes (or more generically, any multi-sequence file). I understand the principles of the BWT, but do aligners such as BWA and Bowtie compute a separate BWT for each sequence or do they concatenate all sequences then compute a single BWT?
I'm interested to know for the sake of it but also because I need to include the mitochondrial and chloroplast genomes in an index, but BWA has one indexing method (IS) that can't handle a 'database' more than 2 Gbp while the other method (BWTSW) can't handle databases smaller than 10 Mbp (the organelle genomes are smaller than this...)
I just don't know if 'database' in the documentation means the sum of all sequences or whether each sequence is considered a separate database. If the sequences are all concatenated then BWTSW should work fine, but otherwise it seems neither single indexing method works for both the large chromosomes I have to deal with and the tiny organelle genomes.
Thanks for your time!
Since a single index is built for one multi-fasta reference that should all count towards the size of the
database
. Someone with right programming chops will need to confirm if that interpretation is technically correct.Thanks for your answer. Ok brilliant, that is the crux of the matter. If a single Burrows-Wheeler transformation is conducted on the whole concatenated sequence then yes, the BWA BWTSW indexer should work.