Question

Low Number Of Replicates Deseq

0

Entering edit mode

10.7 years ago

federico.gaiti ▴ 70

Hi all,

I am using DESeq for DGE analysis.

I have STRANDED RNA-Seq data for 4 developmental stages with no replicates. To have a more reliable DGE I should have replicates and so I obtained (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage.

Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples “cluster” together. If so, I can then use the unstranded data for my DGE analysis to have more replicates per each stage.

I mapped the raw reads to the genome using TOPHAT, sorted the bam files by name and used htseq-count to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no.

I used DESeq to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low.

I then tried to use htseq-count with the option -s reverse for the stranded data and still got really low correlation.

So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data (while both cases before the stranded ones were double in number). I then included metadata, estimated the new size factors, normalized and calculated the new correlation. Both Pearson and Spearman performed pretty well, confirmed by both a PCA and correlogram.

Though, I'd still like to figure out a way to use the stranded counts. I am not sure if I lose some information running htseq-count using -s no on the stranded data.

What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not.

I would like to hear from you if you have any thoughts about this.

Let me know if you need more information to better understand the issue.

Thanks a lot Federico

deseq r variation replicates • 3.8k views

ADD COMMENT • link updated 10.7 years ago by Michele Busby ★ 2.2k • written 10.7 years ago by federico.gaiti ▴ 70

score 1 · Answer 1 · 2014-02-24

1

Entering edit mode

10.7 years ago

Nicolas Rosewick 11k

For me the better to do :

Count the stranded with -s yes (or -s reverse depending on your library type)
Count the unstranded with -s no
In DESeq write a experiment design data frame like that :

Sample Condition LibType

A condX stranded

B condY stranded

C condZ stranded

D condX unstranded

E condY unstranded

F condZ unstranded

G condX unstranded

and follow section 4. of DESeq vignette (http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf) about multi-factor design

ADD COMMENT • link 10.7 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Thanks. I'll give it a try and I'll compare it with the approach that jurgnjin suggested as well.

ADD REPLY • link 10.7 years ago by federico.gaiti ▴ 70

score 0 · Answer 2 · 2014-02-23

There's heaps of small RNAs that are located on the opposite strand of protein-coding genes (literature: "antisense transcription"). Hence, the discrepancy between the stranded and unstranded expression estimates could be a real biological effect.

You can check this hypothesis by looking at the stranded alignments for a few individual of genes with large discrepancies in the stranded vs unstranded expression estimates, and checking whether the unstranded coverage conforms to the splicing structure of the protein-coding gene (it shouldn't).

If this really is the (main) reason for the discrepancies, you could use an unstranded alignment of the stranded library in conjunction with the three unstranded libaries for DGE. The caveat with this approach is that the unstranded expression estimates reflect the total expression at the given locus, not only the expression of protein-coding genes. You should also still compare DGE calls from all four libraries with DGE calls from the three unstranded libraries as a rough sanity check. After all, they were prepared by different labs, and using different protocols...