I'm trying to reproduce a figure of a paper were they sequence RNA and they map it to a virus and a host. They did this using single-end RNA and bowtie and present a plot with the ratio between virus RNA and host+virus mapped RNA for both positive and negative strand.
In my experiment I have paired-end RNA sequencing data, I map it using bowtie as this is what they used but I am not finding the right way to retrieve the positive and negative mapped reads.
The command used (via Rbowtie) is:
bowtie \
~/data/genome/<org>/ref/ref \
-1 data/S1_R1.fastq.gz \
-2 data/S1_R2.fastq.gz \
-v 2 \
--best \
--sam \
--no-unal \
-p 20 \
-fr \
-rf \
out/S1.sam
Later I create the bam file and use samtools to count them:
samtools view -b -F 4 -@ 5 out/S1.sam > out/S1.bam
samtools view -c -F 4 -f 16 -@ 5 out/S1.bam # Reverse reads
samtools view -c -F 4 -@ 5 out/S1.bam #All reads
So I expected that all reads - reverse reads == forward reads
, but always (no matter the options I use in bowtie) I have always the same forward and reverse (mapped) reads, which I think is due to using paired end sequencing. I doubt it is the case, because the virus is introduced as - stranded and when replicating it produces the + strand, so I would expect that both numbers would be different.
I am considering using a different tool if bowtie doesn't work but I would appreciate if someone could help me obtaining the forward and reverse strand.
Thanks
Many thanks for answering. I have some follow-up questions. You recommend at the beginning to use the mapped reads, so at the end I would need to add up the Forward of the R1 and R2 and the Reverse of R1 and R2. However, in your code you do a kmer search and flip the virus, why? It looks that what you propose would double check if the alignment is correct by pairing the F and R of R1 and R2. Isn't this equivalent to sam flag 2 for "the read mapped in proper pair"? The Virus is SARS-CoV-2, which I think is relatively short but I haven't checked for repeated 50-mers.
For myself and future readers, to split a .bam file to get R1 and R2 reads there is this question. Although, it is not clear to me that the "the first segment in the template" and "the last segment in the template" means forward and reverse.
Covid is short and doesn't have any inverted repeats like that. But the reason for my code is that I like to keep paired reads together, which it guarantees, and is harder to do once you have aligned files that you filter for flag bits. You can still accomplish everything using the flag bits, of course, but then some pairs may end up split into different files which is annoying.
Anyway, if you want to use the flag bits, remember that R1 and R2 will map to opposite strands. So you combine forward-mapped R1 with reverse-mapped R2, not forward-mapped R1 with forward-mapped R2.
Many thanks! I just need to count it so I don't mind much about the specific reads.