Following my other post, I was wondering how do the short reads mappers (bwa & bowtie2) handle the "-" strand of the reference at the step to build index. As short reads are always in 5'-->3' direction for either strand, but from the mapping result SAM file there are reads mapped to both strands.
1) I assumed the "-" strand of the reference is concatenated to the "+" so that the final length/index is simply doubled;
2) Or, it is needed to reverse complemented each read at mapping step so that mapping was done twice;
Mapper STAR uses the first approach as the author told me, but I could not figure this out from the source codes of both bwa or bowtie2. Can somebody confirm this for me? Thanks.
I think bwa and bowtie index both the forward and reverse reference, and map once.
BBMap reverse-complements the read and does mapping twice.
Thanks Brian!
Is it possible for you to elaborate the indexing procedure for both the forward and reverse strand, technically? What I can guess is the "-" strand is concatenated to the "+" strand to have a single string. Then, how the offsets for each chromosome are distinguished from each other (2 strands and multiple chromosomes)? I saw the outputs of bwa index step for mouse genome, there are
Are those files with "r" (*.rbwt, *rpac, *.rsa) for the reverse strand? However, if I index the small genome as lambda_virus, I did not see the pattern.
Does this mean if the genome is big (Mouse) then there will be separate index for the "-" strand of the genome?
From bowtie2-build, I have:
Are the .rev.1.bt2, *.rev.2.bt2 for the reverse strand? But, how came there is no .rev.3.bt2 or *.rev.4.bt2?
Thanks a lot!
I have not looked at the code, so my assumption of how they operate was indeed based on the file extension generated when indexing :) I'm not sure why sometimes the "reverse" ones are not present.