Question

strand handling of of reference sequence for index

0

Entering edit mode

7.8 years ago

yifangt86 ▴ 60

Following my other post, I was wondering how do the short reads mappers (bwa & bowtie2) handle the "-" strand of the reference at the step to build index. As short reads are always in 5'-->3' direction for either strand, but from the mapping result SAM file there are reads mapped to both strands.

1) I assumed the "-" strand of the reference is concatenated to the "+" so that the final length/index is simply doubled;

2) Or, it is needed to reverse complemented each read at mapping step so that mapping was done twice;

Mapper STAR uses the first approach as the author told me, but I could not figure this out from the source codes of both bwa or bowtie2. Can somebody confirm this for me? Thanks.

strand handling reference index short read mapping • 1.6k views

ADD COMMENT • link 7.8 years ago by yifangt86 ▴ 60

1

Entering edit mode

I think bwa and bowtie index both the forward and reverse reference, and map once.

BBMap reverse-complements the read and does mapping twice.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian!

Is it possible for you to elaborate the indexing procedure for both the forward and reverse strand, technically? What I can guess is the "-" strand is concatenated to the "+" strand to have a single string. Then, how the offsets for each chromosome are distinguished from each other (2 strands and multiple chromosomes)? I saw the outputs of bwa index step for mouse genome, there are

Mouse_genome.fa.amb
Mouse_genome.fa.ann
Mouse_genome.fa.bwt
Mouse_genome.fa.pac
Mouse_genome.fa.rbwt
Mouse_genome.fa.rpac
Mouse_genome.fa.rsa
Mouse_genome.fa.sa

Are those files with "r" (*.rbwt, *rpac, *.rsa) for the reverse strand? However, if I index the small genome as lambda_virus, I did not see the pattern.

lambda_virus_bwa.amb
lambda_virus_bwa.ann
lambda_virus_bwa.bwt
lambda_virus_bwa.pac
lambda_virus_bwa.sa

Does this mean if the genome is big (Mouse) then there will be separate index for the "-" strand of the genome?

From bowtie2-build, I have:

lambda_virus.1.bt2
lambda_virus.2.bt2
lambda_virus.3.bt2
lambda_virus.4.bt2
lambda_virus.rev.1.bt2
lambda_virus.rev.2.bt2

Are the .rev.1.bt2, *.rev.2.bt2 for the reverse strand? But, how came there is no .rev.3.bt2 or *.rev.4.bt2?

Thanks a lot!

ADD REPLY • link 7.8 years ago by yifangt86 ▴ 60

0

Entering edit mode

I have not looked at the code, so my assumption of how they operate was indeed based on the file extension generated when indexing :) I'm not sure why sometimes the "reverse" ones are not present.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k