Question

Viral positive and negative strand with paired sequencing and bowtie

0

Entering edit mode

19 months ago

Lluís R. ★ 1.2k

I'm trying to reproduce a figure of a paper were they sequence RNA and they map it to a virus and a host. They did this using single-end RNA and bowtie and present a plot with the ratio between virus RNA and host+virus mapped RNA for both positive and negative strand.

In my experiment I have paired-end RNA sequencing data, I map it using bowtie as this is what they used but I am not finding the right way to retrieve the positive and negative mapped reads.

The command used (via Rbowtie) is:

bowtie \
       ~/data/genome/<org>/ref/ref \
       -1 data/S1_R1.fastq.gz \
       -2 data/S1_R2.fastq.gz \
       -v 2 \
       --best \
       --sam  \
       --no-unal \
       -p 20 \
       -fr \
       -rf \
       out/S1.sam

Later I create the bam file and use samtools to count them:

samtools view  -b -F 4 -@ 5 out/S1.sam > out/S1.bam
samtools view -c -F 4 -f 16 -@ 5 out/S1.bam # Reverse reads
samtools view -c -F 4 -@ 5 out/S1.bam #All reads

So I expected that all reads - reverse reads == forward reads, but always (no matter the options I use in bowtie) I have always the same forward and reverse (mapped) reads, which I think is due to using paired end sequencing. I doubt it is the case, because the virus is introduced as - stranded and when replicating it produces the + strand, so I would expect that both numbers would be different.

I am considering using a different tool if bowtie doesn't work but I would appreciate if someone could help me obtaining the forward and reverse strand.

Thanks

bowtie • 1.2k views

ADD COMMENT • link 18 months ago by Lluís R. ★ 1.2k

score 0 · Answer 1 · 2023-10-16

0

Entering edit mode

18 months ago

Brian Bushnell 20k

You should expect the same number of forward and reverse strand reads because read 1 and read 2 are on opposite strands. What would be more interesting here is to first split the mapped file into r1 and r2, then split THOSE files into forward and reverse, then combine R1 forward with R2 reverse and R1 reverse with R2 forward to result in two files.

You could alternately use BBTools and split the fastq before alignment:

Flip the virus:

reformat.sh in=virus.fa out=reverse.fa rcomp

Flip the mate:

reformat.sh in=S1_R#.fastq.gz out=flipped.fq.gz rcompmate

Kmer-match to the forward strand of the virus to split the reads into piles:

seal.sh in=flipped.fq.gz pattern=%.fq.gz ref=virus.fa,reverse.fa k=31 mm=f rcomp=f

Flip the mates again to restore ordinary pairing:

reformat.sh in=virus.fq.gz out=forwardPairs.fq.gz rcompmate
reformat.sh in=reverse.fq.gz out=reversePairs.fq.gz rcompmate

In fact you can do this all with BBDuk in just one line since it has a "skipr2" flag to ignore read 2:

bbduk.sh in=S1_R#.fastq.gz outm=forwardPairs.fq.gz outu=nonforwardPairs.fq.gz k=50 mm=f ref=virus.fa skipr2

...which is the easiest solution, BUT could misplace a few reads if the virus has a copy of a 50-mer on both strands. That's unlikely for small viruses though.

ADD COMMENT • link 18 months ago by Brian Bushnell 20k

0

Entering edit mode

Many thanks for answering. I have some follow-up questions. You recommend at the beginning to use the mapped reads, so at the end I would need to add up the Forward of the R1 and R2 and the Reverse of R1 and R2. However, in your code you do a kmer search and flip the virus, why? It looks that what you propose would double check if the alignment is correct by pairing the F and R of R1 and R2. Isn't this equivalent to sam flag 2 for "the read mapped in proper pair"? The Virus is SARS-CoV-2, which I think is relatively short but I haven't checked for repeated 50-mers.
For myself and future readers, to split a .bam file to get R1 and R2 reads there is this question. Although, it is not clear to me that the "the first segment in the template" and "the last segment in the template" means forward and reverse.

ADD REPLY • link 18 months ago by Lluís R. ★ 1.2k

1

Entering edit mode

Although, it is not clear to me that the "the first segment in the template" and "the last segment in the template" means forward and reverse. That's not what they mean. In fact, there's another bit in the flag that signifies "this read was reverse-complemented". Rather, on the Illumina platform, normal libraries have a DNA fragment that is sequenced from both ends, generating read 1 and read 2, which are read from opposite sides of the fragment. So read 1 and read 2 map to opposite strands.

Covid is short and doesn't have any inverted repeats like that. But the reason for my code is that I like to keep paired reads together, which it guarantees, and is harder to do once you have aligned files that you filter for flag bits. You can still accomplish everything using the flag bits, of course, but then some pairs may end up split into different files which is annoying.

Anyway, if you want to use the flag bits, remember that R1 and R2 will map to opposite strands. So you combine forward-mapped R1 with reverse-mapped R2, not forward-mapped R1 with forward-mapped R2.

ADD REPLY • link 18 months ago by Brian Bushnell 20k

0

Entering edit mode

Many thanks! I just need to count it so I don't mind much about the specific reads.

ADD REPLY • link 18 months ago by Lluís R. ★ 1.2k