Question

RNA-Seq DGE: Evaluate model quality

0

Entering edit mode

6.9 years ago

firestar ★ 1.6k

I am using DESeq2 to find differentially expressed genes between conditions control/treated. The PCA using vst counts shows family-based clustering.

enter image description here

I have conjured up a few different GLM models as below along with the number of differentially expressed genes detected:

Model Num of DEGs
~condition 80
~family+condition 623

(using principal components from pca above)
~pc1+pc2+pc3+condition 21

(sva was used to generate one surrogate variable)
~sva+condition 73
~sva+family+condition 568

A venn diagram of the DEGs from the 5 models looks like:

enter image description here

Is there some way to objectively evaluate (quantify) which model is better?

This question is related to this.

differential-gene-expression RNA-Seq deseq2 • 2.2k views

ADD COMMENT • link updated 8 months ago by Ram 44k • written 6.9 years ago by firestar ★ 1.6k

score 2 · Answer 1 · 2018-02-17

2

Entering edit mode

6.9 years ago

Devon Ryan 105k

I'm not sure why you bothered expending the effort to use SVA and add principal components as covariates. Unsurprisingly, you have a family effect that you need to account for. So ~family + condition is the logical first (and probably only) model to use given that.

ADD COMMENT • link 6.9 years ago by Devon Ryan 105k

0

Entering edit mode

I agree with Devon! Devon, I was hoping that you would respond. There was a discussion in the previous thread: RNA-Seq DGE using PCA components as covariate

The ~ family + condition model is also what I would choose, but definitely not do any adjustment based on PC covariates

ADD REPLY • link 6.9 years ago by Kevin Blighe 88k

0

Entering edit mode

@Kevin The general suggestion is that I use family+condition. I do get your explanation behind it. But, I would still like to know how one would evaluate these models. How do I know which model gives me the 'correct' set of DEGs. Is there some metric that I could use to decide model is better? For example something like AIC, but I am not sure how one would get to that using DESeq2.

ADD REPLY • link 6.9 years ago by firestar ★ 1.6k

1

Entering edit mode

The issue is mostly that each gene has its own AIC/BIC. The normal model comparison metrics aren't intended for cases where you're applying that model independently to 30,000 datasets.

In general, one uses the simplest possible model that reasonably describes the model (fun fact: if you add enough covariates to your model you can get almost everything to be DE...I accidentally did that once and was confused that I suddenly had thousands of DE genes associated with a small effect-size phenotype). Using PCs or SVA is reasonable only when you have some sort of confounding effect that you don't already understand. You know that families are a confounder already.

As an aside, principal components are basically an inferior form of what SVA is doing.

ADD REPLY • link 6.9 years ago by Devon Ryan 105k

0

Entering edit mode

@Devon Yep ~family + condition is my current choice. Sure, I could just do that and move on. But, I am just curious how the other models work. All I have at the moment to evaluate this is the number of DEGs found. I am trying to find out if there is a better metric that I could use to evaluate which model is best.

ADD REPLY • link 6.9 years ago by firestar ★ 1.6k

1

Entering edit mode

Devon's experience matches that of my own in that the simplest model should always be chosen. You do not want to over-adjust, because then, as Devon (and I) implies, 'everything' becomes statistically significant. I cannot see past the best model in your situation being ~ family + condition. It does not make sense to use PC covariates in this particular context. You even include PC3 but only show PC1 and PC2 in your bi-plot. PC3 may very well be meaningless.

Having a good study design generally makes the analyst's job a lot easier, too.

With regard to the problem of figuring out if the model is the best one or not, that will be played out as you perform downstream analyses. If the model is entirely inaccurate, then certain genes will have failed the independent filtering and Cook's distance tests, and have their P values set to NA - see Here for further details. Also, if you perform something like hierarchical clustering using your statistically significant differentially expressed genes, and they don't actually segregate your cases and controls, then something is wrong.

Generally, a RNA-seq experiment is an exploratory analysis, so, we should keep an open mind about which genes may be involved in our condition of interest. The ultimate test will then be in follow-up studies where we aim to build a predictive regression model using our genes of interest, which can usually be done with a targeted technology like NanoString, Fluidigm, or something else. In these situations, the AIC/BIC to which Devon refers become more important as we literally heavily scrutinise each and every gene in the final panel in order to gauge its predictive potential.

Apologies if my comments come cross as harsh but I also in the past struggled over which design model to choose.

ADD REPLY • link 6.9 years ago by Kevin Blighe 88k