Question

Taking into account antinsense reads when analyzing stranded Illumina data with DESeq2

0

Entering edit mode

8.3 years ago

snamjoshi87 ▴ 40

I have just finished running count commands using HTSeq-Count and am about to begin DGE analysis with DESeq2 using the vignette as a guide. However, I found this blog post online which indicates that there may be issues if you do not take into account the fact that Illumina data produces antisense reads.

Having read the blog post a few times, I'm still a little confused. It seems that there is now an argument to the summarizeOverlap() function preprocess.reads=invertStrand that allows you to take strandedness into account. Is this all I need to alter for my data? Or do I need to invert the strand using GenomicRanges? Outside of this, I did not take the strandedness into account with my alignment or counts so I am concerned that my raw count data may also be incorrect.

Here is what I ran for Tophat2 alignment:

tophat2 -G <GTF file> -p 4 -o <output folder> <index> <FASTQ file>

And here is what I ran for HTSeq-Count:

python -m HTSeq.scripts.count -s yes -a 10 <name sorted SAM> <GTF file> > <output>

As you can see, I did not take into account the library type for my tophat2 run. For HTSeq-Count I specified it was stranded but not to reverse it. So I really am not sure where this leaves me...

According the facility that did the sequencing, they used the TruSeq Stranded Total RNA kit. This experiment is single-read from mouse neural tissue (RIPSeq). Let me know if any more information is needed.

In summary, my questions:

Is my count data okay as is?
What do I need to do in DESeq2 to correct for this antisense issue?

Versions: Tophat 2.1.1 with Bowtie 2.2.9 and Samtools 0.1.19; HTSeq 0.6.1

RNA-Seq DESeq2 • 3.2k views

ADD COMMENT • link updated 8.3 years ago by Devon Ryan 104k • written 8.3 years ago by snamjoshi87 ▴ 40

score 3 · Accepted Answer · 2016-07-25

3

Entering edit mode

8.3 years ago

Devon Ryan 104k

The blog post you linked to seems to have been written by someone without much experience in RNAseq (while the DESeq2 tutorial demonstrates how to do counting in R, I don't know of many people who would actually do that (if they did, they'd use subRead anyway)). I can assure you that the DESeq2 authors have plenty of experience with both stranded and unstranded datasets and you don't need to do anything in DESeq2 for dealing with strand-specific datasets (you're just giving it counts, it doesn't matter what the underlying library type is at that point). Note that with anything recent, you'll want -s reverse with htseq-count rather than -s yes. This is for historical reasons (early on the common stranded protocols worked different).

Regarding tophat2, you might get slightly higher alignment rates if you tell it your data is stranded. You're unlikely to get completely different results by not telling the aligner this, you'll just have a bit higher false negative rate.

ADD COMMENT • link 8.3 years ago by Devon Ryan 104k

0

Entering edit mode

By using the command -s reverse I got a lot more counts than what I originally had. The strandedness issue is very confusing to me but I think I figured it out after digging through old forum posts. Thanks for your help.

ADD REPLY • link 8.3 years ago by snamjoshi87 ▴ 40

0

Entering edit mode

Hi

I am new on this, and reading over the forums, I have understood that the -s option in Htseq depends on the protocol. So, if I have files defined as strand-specific in the protocol, for me the logic selection is: -s="yes". So, I am wondering why you insist on try -s="reverse" with htseq-count rather than -s="yes" for recent sequencing data... (and what means recent?..5,5, yrs)... So, I am in blank... I do not have clear why to use -s reverse... & I do not understand what do you mean with ... "...for historical reasons.."... Please any help will be appreciate!

Thanks in advance Cynthia SC

ADD REPLY • link 5.4 years ago by cyntsc • 0