Question

Tutorial:How To Separate Illumina Based Strand Specific Rna-Seq Alignments By Strand

73

Entering edit mode

10.8 years ago

Istvan Albert 102k

Today we have run into the task of having to split strand specific RNA-Seq data by strand and we had to make an effort to get it right (hopefully it is right). Maybe there is even an easier way to do it.

The Illumina strand specific protocol is such that only the fragment of the original orientation is sequenced. But as it happens the reverse complement of it will actually be sequenced and is placed into file R1. The mate pair that goes into file R2 will be the reverse complement of the fragment starting in R1 and thus its orientation will correspond to the original orientation. This makes the protocol produce a R2-R1 read orientation.

Don't forget to perform a sanity check. The total coverage should be equal to the sum of coverages for the split data.

Here is our script that splits a bam file into two fwd.bam and rev.bam each containing the alignments that are in the transcriptional direction. Run int as:

sh split.sh input.bam

Shell script using samtools:

set -ue

# Get the bam file from the command line
DATA=$1

# Forward strand.
#
# 1. alignments of the second in pair if they map to the forward strand
# 2. alignments of the first in pair if they map to the reverse  strand
#
samtools view -b -f 128 -F 16 $DATA > fwd1.bam
samtools index fwd1.bam

samtools view -b -f 80 $DATA > fwd2.bam
samtools index fwd2.bam

#
# Combine alignments that originate on the forward strand.
#
samtools merge -f fwd.bam fwd1.bam fwd2.bam
samtools index fwd.bam

# Reverse strand
#
# 1. alignments of the second in pair if they map to the reverse strand
# 2. alignments of the first in pair if they map to the forward strand
#
samtools view -b -f 144 $DATA > rev1.bam
samtools index rev1.bam

samtools view -b -f 64 -F 16 $DATA > rev2.bam
samtools index rev2.bam

#
# Combine alignments that originate on the reverse strand.
#
samtools merge -f rev.bam rev1.bam rev2.bam
samtools index rev.bam

If keeping full pairs is essential then the filtering steps need to be performed to operate on mates like so:

#
# 1. alignments of the second in pair if they map to the forward strand
# 2. alignments of the first in pair if their mate maps to the forward strand
#
samtools view -b -f 128 -F 16 $DATA > fwd1.bam
samtools index fwd1.bam

samtools view -b -f 64 -F 32 $DATA > fwd2.bam
samtools index fwd2.bam

and similarly:

# 1. alignments of the second in pair if it maps to the reverse strand
# 2. alignments of the first in pair if their mates map to the reverse strand
#
samtools view -b -f 144 $DATA > rev1.bam
samtools index rev1.bam

samtools view -b -f 96 $DATA > rev2.bam
samtools index rev2.bam

RNA-seq • 29k views

ADD COMMENT • link updated 12 months ago by snardeli • 0 • written 10.8 years ago by Istvan Albert 102k

2

Entering edit mode

Hi Istvan Albert! Just to tell u a big THANK YOU for the amazing script. It works wonders :) Looked around everywhere for this..until I read yours. The antisense reads obtained now also correlate to the RSeQC infer_experiment.py script that determines strandedness. so thats a good proof check too. Thanks!

ADD REPLY • link 8.2 years ago by Nalini ▴ 20

0

Entering edit mode

glad to hear it helped - it is one of those things that should not be this complicated.

ADD REPLY • link 8.2 years ago by Istvan Albert 102k

0

Entering edit mode

Hi Albert, I wanted to check with you if this script for seperating forward and reverse will also work on RNA sequencing done using dUTP method for strand specificity? Or is it only when we use different adaptors as means for strand specificity (like the one described in the ScriptSeq protocol)? Thanks!

ADD REPLY • link 8.2 years ago by Nalini ▴ 20

0

Entering edit mode

I am not sufficiently familiar with the alternative library prep for various protocols. The separation in the script above applies only when the the second in pair (read 2) is oriented in the original transcript direction.

When the 1 st pair is oriented in the original transcript direction the flags need to be changed. The logic is the same only you need to select the exact opposite.

ADD REPLY • link 8.2 years ago by Istvan Albert 102k

0

Entering edit mode

Hey..ur script worked for my data where the protocol I had used for RNAseq was called scriptseq - which uses adaptors. Is that what you mean by "when the the second in pair (read 2) is oriented in the original transcript direction."? Sorry, am a little confused now!

ADD REPLY • link 8.2 years ago by Nalini ▴ 20

0

Entering edit mode

Hello! As mentioned above, the strand specific script worked really well and I have used that for my work. I am happy to inform you that we are writing a paper on the data analysis soon, so wanted to check with you how should I cite you? As in do you have this script published any where or could I mark this URL in references? Thanks :)

ADD REPLY • link 7.3 years ago by Nalini ▴ 20

1

Entering edit mode

That won't be necessary - think of it as a book that you've learned certain basic skills from rather than a source of a scientific insight.

ADD REPLY • link 7.3 years ago by Istvan Albert 102k

0

Entering edit mode

Hi Istavan, was looking at your script and wondering if it will account for the second read and change its orientation to map on the same strand as the first read so as to get the complete directional fidelity of the illumina strand-specific RNA-seq data with pair-end reads? Maybe some background to my question is that, i extracted stranded reads from my bam files, but it appeared that i was getting only picking 2 of my pair-end reads as single reads on each strand thus ended up with almost the same coverage plot for both strands..any suggestions on how to best get a true representation of each strand for my illumina pair-end reads?

Thanx

ADD REPLY • link updated 10.6 years ago by Istvan Albert 102k • written 10.6 years ago by stan ▴ 80

1

Entering edit mode

what gets sequenced is a DNA that comes from a template so you will necessarily get (almost) equal amounts of reads for both strands. What the strand specific protocol tells you is what was the original orientation of the template. The read that comes from file 2 tells the story of what was the original orientation and the data in file 1 is the reverse complement of the other end of the template.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Istvan Albert 102k

0

Entering edit mode

Thanx Istvan, that really clarified a lot of things for me

ADD REPLY • link 10.6 years ago by stan ▴ 80

0

Entering edit mode

Hey Istavan

I was working on this script. I divided my reads inot fwd and reverse strand reads using your code. When I visualized it using IGV, the reads of the reverse strand (i.e from file rev.bam) had an orientation F1R2 and the reads from the forward strand (fwd.bam) had an orientation F2R1. But some literature say that F2R1 is created from the antisense strand and F1R2 is created from sense strand. Can you please clear this confusion

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.1 years ago by AriyurD • 0

0

Entering edit mode

it depends on the library preparation step. Over the course of the library prep there are typically many steps of duplicating and degrading the single stranded molecules. Depending on how many steps there are the sequenced product may be the original sequence or the reverse complement of it and thus the actual 5' end will be placed into file1 or file2. The typical Illumina protocol is the one described above. Here are more details:

Comprehensive comparative analysis of strand-specific RNA sequencing methods, Nature Methods, 2010

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 102k

0

Entering edit mode

Nice, this is what I want. by the way, it is fit for dUTP strand-specific RNA-seq.

Here is another problem when I create Bedgraph files using these BAM files:

(1) what if the length of reads (read1 + read2) shorter than the insert seq? what is your choice for the gap that not sequenced that will not shown in the bedgraph?

(2) what if the length of reads (read1 + read2) longer that the insert seq? the overlap region of the paired reads will double counted, is that OK?

ADD REPLY • link 9.5 years ago by wm ▴ 570

1

Entering edit mode

in general it is not advisable to ask a new question in a comment because most people won't notice that question. When you have paired end reads and you count them you'd be always be double counting regardless of the insert size -> one fragment leads to two counts. If the coverage is reasonable uniform the size of the insert should not matter as it should uniformly cover all regions.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Istvan Albert 102k

0

Entering edit mode

Hi just found this script and wanted to say thanks. It works fine and saved me a lot of trouble :)

ADD REPLY • link 9.3 years ago by kenietz78 • 0

0

Entering edit mode

Can it be used for alignment file generated from stranded single-end read data (ion proton)?

ADD REPLY • link 8.7 years ago by fizer ▴ 30

0

Entering edit mode

I am not entirely sure how the protocol works for ion-proton. It would only be applicable if the sequenced read comes from the reverse complement of the transcript.

ADD REPLY • link 8.7 years ago by Istvan Albert 102k

0

Entering edit mode

Wondering if this or a similar approach can be taken for lectures coming from a stranded (dUTP) library..

ADD REPLY • link 7.0 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

Hi Istvan. I am facing a task of separating sense from antisense transcripts and this post looks very helpful. Could you please clarify for me whether the forward and reverse strands are relative to a chromosome or they actually means sense and antisense strands? I am getting confused..Thank you very much and I look forward to your reply!

ADD REPLY • link 6.9 years ago by tangj • 0

0

Entering edit mode

Hi Istvan Albert ! I have the same question. Any inputs on this? Thanks!

ADD REPLY • link 12 months ago by snardeli • 0

0

Entering edit mode

Hi, I'm new to bioinformatics so I'm trying to understand how the SAM flags work in this script. My original SAM file has flags: 99, 147, 163, and 83. But this script uses flags: 128, 80, 144, and 64. Can someone help me understand the difference between these flags? Would this script still separate my BAM file into forward and reverse strands with the flags it lists or should I edit the script to reflect my flags (99, 147, 163, 83)? Thank you for any information and help you can provide about these questions so I can understand this better!

ADD REPLY • link 3.9 years ago by GA • 0

0

Entering edit mode

Any odd flag has the byte 1 on because paired end reads were used in the mapping

Obviously any even flag number means you were mapping single reads

ADD REPLY • link 3.9 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

Thanks much! It helped!

ADD REPLY • link 2.3 years ago by upasana ▴ 20