Question

Antisense transcription from strand specific RNA-seq

3

Entering edit mode

6.5 years ago

sorrymouse ▴ 120

I have a number of libraries prepared with the Illumina Tru-seq strand specific library prep kit. I want to detect sense and antisense transcription of each gene.

I mapped the reads using hisat2 with the --rna-strandedness RF option.

The recommendations I can find for separating the bam files by strand, so that I can essentially have a sense .bam and an antisense .bam, are to use samtools -F 0x10 and -f 0x10, or similar flags using 16 and 32.

Whatever method I take, I get the same general output, but I can't shake this nagging feeling that I am not actually separating them based on sense and antisense transcription but rather some aspect of the library that is sense + something and antisense+ something.

I know that down the line I will need to check into the directionality of each gene, and that just separating by sense and antisense will include genes in the opposite orientation in each file.

But to begin with, how would one go about examining sense and antisense transcription by separating the .bam files?

I know this is not the first post on this, but many posts do not seem to elaborate whether they are interested only in separating the files by strand or sense and antisense transcription, and I don't think they are entirely the same thing.

RNA-Seq • 8.2k views

ADD COMMENT • link 6.5 years ago by sorrymouse ▴ 120

1

Entering edit mode

Hi,

According to the Hisat manual, the option just add the XS-tag to each alignment. To check the sence/antisense mappings, you might want to look at RSeQC infer experiment. There the read-orientation is analysed in regard of the transcripts' orientation.

ADD REPLY • link 6.5 years ago by michael.ante ★ 4.0k

0

Entering edit mode

I agree that there is probably a simpler way to do this rather than splitting the files, but it was reassuring to make sure at each step that what should be happening was happening, by checking in IGV.

I will say that running stringtie separately on the negative and positive reads increases the number of predicted features by A LOT and I don't have any sense of the 'rightness' of this.

ADD REPLY • link 6.5 years ago by sorrymouse ▴ 120

h.mon · Answer 1 · 2018-11-08

What I ended up doing is somewhat redundant I think, but its as follows:

map the reads with hisat2

./hisat2 --phred33 --rna-strandness RF -x reference.fa -1 reads1.fq -2 reads2.fq -S out.sam

convert and sort with samtools

samtools view -Su out.sam | samtools sort - out.sorted

first read in pair maps to reverse strand, read is mapped in proper pair, read is paired

samtools view -f 83 -b out.sorted.bam > out.sorted.83.bam
samtools view -f 163 -b out.bam > out.sorted.163.bam

second read in pair maps to reverse strand, read is paired, read mapped to proper pair

samtools view -f 99 -b out.sorted.bam > out.sorted.99.bam
samtools view -f 147 -b out.bam > out.sorted.147.bam

merge files

samtools merge out.sorted.83.163.bam out.sorted.83.bam out.sorted.163.bam
samtools merge out.sorted.99.147.bam out.sorted.99.bam out.sorted.147.bam

stringtie to create a gtf reference - this is a non model organism

stringtie out.sorted.99.147.bam -o out.99.147.stringtie.gtf -p 3 --rf -m 50
stringtie out.sorted.83.163.bam -o out.83.163.stringtie.gtf -p 3 --rf -m 50
stringtie merge out.83.163.stringtie.gtf out.99.147.stringtie.gtf -o stringtie_merged.gtf

extract features that are on the + or - strand from stringtie

cat stringtie.merged.gtf | awk '$7 == "+" { print $0 }' > forward_cds.gff
cat stringtie.merged.gtf | awk '$7 == "-" { print $0 }' > reverse_cds.gff

separate out reads mapping to positive and negative features from the stranded read files to get sense and antisense transcription

bedtools intersect -b forward_cds.gff -abam out.sorted.83.163.bam > out.sorted.83.163.sense.bam
bedtools intersect -b reverse_cds.gff -abam out.sorted.83.163.bam > out.sorted.83.163.antisense.bam
bedtools intersect -b forward_cds.gff -abam out.sorted.99.147.bam > out.sorted.99.147.antisense.bam
bedtools intersect -b reverse_cds.gff -abam out.sorted.99.147.bam > out.sorted.99.147.sense.bam

merge the two sense and antisense files

samtools merge out_flags_sense.bam out.sorted.83.163.sense.bam out.sorted.99.147.sense.bam
samtools merge out_flags_antisense.bam out.sorted.83.163.antisense.bam out.99.147.antisense.bam

those bam files can be used in bedtools coverage, or I put them back into stringtie so that they can input into Deseq2

stringtie -eB -G stringtie_merged.gtf out_hisat_flags_sense.bam -o out_flags_sense.gtf
stringtie -eB -G stringtie_merged.gtf out_hisat_flags_antisense.bam -o out_flags_antisense.gtf

If I'm doing this wrong, I sure would love to know!! If anyone sees a flaw with this code, please point it out to me.

score 2 · Answer 2 · 2018-11-08

samtools flags for strand / reverse strand refers to the positive / negative strand of the reference genome / transcriptome used to build the index. So if you mapped to a reference genome, these flags says nothing about sense / antisense transcription. However, if you used a "correctly stranded" transcriptome as reference, then you can use sam flags to filter sense / antisense reads.

As michael.ante said, the hisat2 --rna-strandedness RF option adds XS attribute to each read, then if you want to filter sense / antisense mapped reads, you can use this XS attribute. You could also use bedtools intersect with -s or -S.

score 2 · Answer 3 · 2018-11-09

2

Entering edit mode

6.5 years ago

Carlo Yague 9.0k

I want to detect sense and antisense transcription of each gene.

To do that, you don't actually have to split your bam into fwd and reverse and I think that what you are trying to do in your answer is overly complicated. An approach I used with success is to simply count the reads mapping in forward and reverse of the annotated genes with featurecounts on the original bam file. You simply have to change the "-s" option the either count the reads that map sense or anti-sense of your genes. Note that featurecounts can handle paire-end data smartly and that the count data can be directly inputed into DESeq2.

ADD COMMENT • link 6.5 years ago by Carlo Yague 9.0k

0

Entering edit mode

This is a good point - but this is more of a final solution that one aspires to once they get a firm understanding of the processes that take place. I will make this into a recipe as well.

I will say that in general one should start out by splitting the files, visualizing and understanding what each data contains - so that they are confident about what tools do. It is way too easy to slip up and misspell -s or -S or even a sam flag, and ending up with neither sense nor antisense but just "nonsense" (pun intended :-) counts that later are more difficult to identify.

ADD REPLY • link 6.5 years ago by Istvan Albert 102k

score 1 · Answer 4 · 2018-11-08

1

Entering edit mode

6.5 years ago

Istvan Albert 102k

I am creating a recipe called "Anti Sense Transcripts" as a teaching aid:

https://www.bioinformatics.recipes/recipe/view/antisense/

As of today (November 8, 2018) I am still exploring the best way to make the point but I think it is already useful.

Look at the code for the guidance of what each step does. Also the results will allow you to visualize the outputs that have already been created.

ADD COMMENT • link 6.5 years ago by Istvan Albert 102k

0

Entering edit mode

Your recipe for anti sense transcripts really helped me lot. But I still have some confusions. When you use bedtools to splite the antisense read, you get antisense reads exactly antisense to the sense transcripts. But it might have missed some reads that are in the gap of two sense transcripts(no need to be in the same direction). These reads are meaningful since not all antisense transcripts are exactly antisense to the sense transcripts. so do we need to obtain these gap reads in our antisense identifying pipeline?

ADD REPLY • link 6.1 years ago by keleyep • 0

0

Entering edit mode

that is correct, this recipe only deals with reads that do intersect with the CDS features. But you can generalize it by taking the intersect not with CDS but say gene coordinates (using a different feature file). In that case, all reads would be included.

ADD REPLY • link 6.1 years ago by Istvan Albert 102k

0

Entering edit mode

I really appreciate your reply for these antisense problem. So if I want to find the real antisense transctripts. I would include these reads. I tried bedtools intersect - v parameter to get reads that not overlap with forward CDS feature or reverse CDS feature. Is it the the right reads that I should include? It seems that it has some redundant with antisense reads. I don't know it for sure.

Another question is that I am dealing with some paired end read so there are some difference for SAM FLAGS. here are my codes for paired ends reads.

samtools view -b -F 4 -F 99 -F 147 ${LCA}/forward_overlap.bam > ${LCA}/forward_antisense.bam samtools view -b -F 4 -f 83 -f 163 ${LCA}/reverse_overlap.bam > ${LCA}/reverse_antisense.bam

is it correct for antisense?

ADD REPLY • link 6.1 years ago by keleyep • 0