Question

Differential expression for two very different samples

4

Entering edit mode

7.4 years ago

I0110 ▴ 160

Standard tools for differential expression analysis tools (e.g. edgeR and DESeq2) assume that most genes in the samples are equally expressed, and only a small fraction of genes are differentially expressed. I was wondering how we can compare two very different RNA samples. For example, one from muscle and the other from liver. I know some people just use a more stringent criterion (e.g. 4-fold difference and FDR <0.001). Is there a more statistically sound way to do the analysis? Thanks!

RNA-Seq Statistics • 4.9k views

ADD COMMENT • link updated 7.4 years ago by James Ashmore ★ 3.5k • written 7.4 years ago by I0110 ▴ 160

0

Entering edit mode

Standard tools for differential expression analysis tools (e.g. edgeR and DESeq2) assume that most genes in the sample are equally expressed, and only a small fraction of genes are differentially expressed.

Are you sure about that?

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

2

Entering edit mode

e.g. For example: "Still, it is important to keep in mind that even these methods are based on an assumption that most genes are equivalently expressed in the samples." from https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-91

ADD REPLY • link 7.4 years ago by I0110 ▴ 160

0

Entering edit mode

Thanks for the citation, I appreciate it! I won't actually believe it until I see it stated by the writers of the tools, but it doesn't seem unlikely.

That said - don't consider my opinion here to be authoritative, but people use those tools all the time for differential expression analysis between tissue types. RNA-seq always seems to be unpredictable and hard to reproduce, though, so I'm not really sure how you would validate that an approach is working correctly.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

It's indeed an assumption of DESeq2 and similar tools. Now I'm trying to find a reference for that too...

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Please correct me if I am wrong. I guess it is difficult to get "normalized counts" for very different samples. Indeed, most people just go head use these tools with different organs or tissues, but I just wonder if there is a better way. :-) Another way to think about this, maybe it is meaningless to analyze differentially expressed genes between tissues since they are already too different.

ADD REPLY • link 7.4 years ago by I0110 ▴ 160

1

Entering edit mode

I guess you could normalize to a a priori selected set of housekeeping genes as "stable background".

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

I know people use a selected set of housekeeping genes as controls for qPCR analysis. Could we do that in RNA-Seq analysis? Could you provide a reference for that? Thanks!

ADD REPLY • link 7.4 years ago by I0110 ▴ 160

score 4 · Answer 1 · 2017-07-14

Although the authors state that most of the genes should not be differentially expressed, I think (and remember reading from one of the authors from one of those packages on some forum) the packages are robust to having a sizeable proportion of truly differentially expressed genes, as long there are also a lot of non-differentially expressed genes for parameter estimation.

For edgeR, you can adjust the proportion of tags used for parameter estimation, such as to alleviate problems arising from too many truly differentially expressed genes - see the discussion here. In short, use the parameter logratioTrim in the function calcNormFactors().

P.S.: you could probably get an answer from the packages authors at the Bioconductor support forum: https://support.bioconductor.org

score 2 · Answer 2 · 2017-07-15

2

Entering edit mode

7.4 years ago

Michele Busby ★ 2.2k

Since the main problem here would be drawing a line through the middle of the genes to normalize the two sets, if you are creating your own data you may want to spike in ERCCs. These are RNA sequences that you would put into each sample in the same quantity to assist with normalization later.

In truth, if most of the genes are expressed differently than the more stringent cutoff is probably more to do with prioritizing genes rather than normalization, i.e. what you can plan on doing with a list of 10K genes.

ADD COMMENT • link 7.4 years ago by Michele Busby ★ 2.2k

0

Entering edit mode

It is a very good idea to prioritizing genes instead of normalizing genes. Thanks, Michele.

ADD REPLY • link 7.4 years ago by I0110 ▴ 160

score 2 · Answer 3 · 2017-07-15

2

Entering edit mode

7.4 years ago

James Ashmore ★ 3.5k

Have a look at the R package called qsmooth and it's associated manuscript. It gives a very similar example to what you have mentioned in your question.

http://www.biorxiv.org/content/early/2016/11/03/085175

ADD COMMENT • link 7.4 years ago by James Ashmore ★ 3.5k

0

Entering edit mode

Thanks so much, James! For others interested in this method, a user's guide can be found at https://github.com/stephaniehicks/qsmooth/blob/master/vignettes/qsmooth-vignette.pdf

ADD REPLY • link 7.4 years ago by I0110 ▴ 160