Question

Adjusting for confounding variables in RNA-SEQ Differential Expression Analysis

4

Entering edit mode

6.3 years ago

ucakhnd ▴ 40

My project is working on a large dataset of RPKM values for patients with and without Schizophrenia.

After some preprocessing steps including dumping genes with lots of zero RPKM values and log2-transforming, I have applied Non-negative Matrix Factorization (NNMF) as a dimensionality reduction technique. I am looking for statistically significant correlations between the resulting groups of genes ('metagenes') and schizophrenia.

Until now, I have been using a simple t-test, with Bonferroni correction, to test the metagene expression values for correlation with Schizophrenia. I think that the normality condition is fine because there are about 150 cases and 170 controls - so CLT holds. Some of the results have such very low adjusted p-values that I am relatively certain I have found something interesting.

However, I need to be sure absolutely sure that this is not down to confounding factors. There are slight imbalances by demographic in the schizophrenia vs. non-schizophrenia groups - I need to correct for a few variables, both discrete and continuous - the full list I want to correct for is: Age, Sex, Race, Smoking or not, Postmortem interval, sample pH, and RNA integrity number.

Is there a statistical test, more advanced than the t-test, that can be applied that will ACCOUNT for the impact of these confounding covariates, and make sure that I really have found statistically significant correlations with Schizophrenia? If there is not, then can you recommend how I could change my procedure to best guard against the the confounding factors?

Want to make sure I'm reporting solid results! Thanks for any help you can give.

Differential-Gene-Expression RNA-Seq Statistics • 7.0k views

ADD COMMENT • link updated 7 months ago by Ram 44k • written 6.3 years ago by ucakhnd ▴ 40

score 5 · Answer 1 · 2018-08-16

5

Entering edit mode

6.3 years ago

Kevin Blighe 88k

Unfortunately, RPKM data is not ideal for the purposes of cross-sample differential expression analysis. Please read THIS. In their key points:

The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Logging (base 2) RPKM data may make it's distribution look more even, but the underlying issue still persists. I suggest that you go back to obtain the raw counts and re-normalise by TMM (edgeR) or using DESeq2's method.

Nevertheless, in order to 'adjust' for confounding factors / covariates, you should probably first check whether these factors are actually influential (statistically) on schizophrenia by testing each independently:

MyData
  Schizophrenia  Gene1   Age  Sex   ...
  Y              2.33    45   M     ...
  Y              3.21    43   M     ...
  N              1.21    26   F     ...
  N              2.11    35   T     ...
  ...

Check encoding:

MyData$Schizophrenia <- factor(MyData$Schizophrenia, levels = c('N','Y'))
MyData$Age <- as.numeric(MyData$Age)
et cetera

Check each in a logistic regression model:

summary(glm(Schizophrenia ~ Age, data = MyData, family = binomial(link = 'logit')))
summary(glm(Schizophrenia ~ Sex, data = MyData, family = binomial(link = 'logit')))
summary(glm(Schizophrenia ~ Race, data = MyData, family = binomial(link = 'logit')))
summary(glm(Schizophrenia ~ SmokingStatus, data = MyData, family = binomial(link = 'logit')))
summary(glm(Schizophrenia ~ PostmortemInterval, data = MyData, family = binomial(link = 'logit')))
summary(glm(Schizophrenia ~ SamplepH, data = MyData, family = binomial(link = 'logit')))
summary(glm(Schizophrenia ~ RIN, data = MyData, family = binomial(link = 'logit')))

If either of these are not statistically significant, you may consider leaving them out. You then test each gene independently and include the statistically significant factors in the model formula:

For example, if Age , Sex, SmokingStatus, and PostmortemInterval are statistically significantly associated with Schizophrenia from the above independent models, then, when testing each gene, we may adjust for the effects introduced by these covariates as follows:

model <- glm(Schizophrenia ~ Gene1 + Age + Sex + SmokingStatus + PostmortemInterval,
  data = MyData, family = binomial(link = 'logit'))
summary(model)

model <- glm(Schizophrenia ~ Gene2 + Age + Sex + SmokingStatus + PostmortemInterval,
  data = MyData, family = binomial(link = 'logit'))
summary(model)

*et cetera*

If you need to set this up as a loop over all genes:

genelist <- c("Gene1", "Gene2", ..., "GeneX")
for (i in 1:length(genelist)) {
   formula <- as.formula(paste("Schizophrenia ~ ", genelist[i], " + Age + Sex + SmokingStatus + PostmortemInterval", sep=""))
   model <- glm(formula, data = MyData)
   print(summary(model))
}

This code could easily be done via lapply, and/or parallelised via mclapply or foreach

Kevin

ADD COMMENT • link 3.6 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin, Thanks for your answer. What if either of these is statistically significant happened in my sample information?

ADD REPLY • link 5.5 years ago by chuanjiao6 • 0

0

Entering edit mode

I do not understand what you mean

ADD REPLY • link 5.5 years ago by Kevin Blighe 88k

0

Entering edit mode

I am sorry. I mean if two varies are confounded together, how to deal with this stuff?

ADD REPLY • link 5.5 years ago by chuanjiao6 • 0

0

Entering edit mode

How would write the loop of the logistic regression model if instead of "Schizophrenia" (so the treatment condition) you want to see how each gene is affected by age and sex independently from the health status of the patient (and so basically and identify the residual)?

ADD REPLY • link 3.7 years ago by camillab. ▴ 160

1

Entering edit mode

Hi, I think that the same idea applies but that you just use a different formula? The above method is inefficient due to the fact that just a for loop is being used

You may get what you need via my other Bioconductor package, RegParallel? - https://bioconductor.org/packages/release/data/experiment/vignettes/RegParallel/inst/doc/RegParallel.html#perform-a-basic-linear-regression

ADD REPLY • link 3.7 years ago by Kevin Blighe 88k

0

Entering edit mode

Does it work only on normalised the raw counts (DESeq2) or can be used also on RPKMs?

ADD REPLY • link 3.7 years ago by camillab. ▴ 160

1

Entering edit mode

These models should preferably be run on, e.g., the regularised log or variance-stabilised expression levels, if using DESeq2. If you have RPKM, I would log2(RPKM + 0.1) these.

ADD REPLY • link 3.7 years ago by Kevin Blighe 88k

1

Entering edit mode

As I understand you do indeed just want a formula of form:

GENE ~ age + sex

..or:

GENE ~ age
GENE ~ sex

So, just a linear regression via lm(). To re-use the above code, that would be something like:

Quick example

mydata <- data.frame(
  Age = c(25,43,18,48,58,53),
  Sex = factor(c('M','M','M','F','F','F'), levels = c('F', 'M')),
  gene1 = c(1,2,3,4,5,6),
  gene2 = c(5,4,3,2,1,0))
genes <- c('gene1', 'gene2')

for (i in 1:length(genes)) {
  formula <- as.formula(paste0(genes[i], ' ~ Age + Sex'))
  model <- glm(formula, data = mydata)
  message(genes[i], '\n#################')
  print(summary(model)$coefficients)
  message('\n')
}

gene1
#################
                Estimate Std. Error     t value  Pr(>|t|)
(Intercept)  5.277003484 3.19455200  1.65187591 0.1971278
Age         -0.005226481 0.05895096 -0.08865812 0.9349405
SexM        -3.127177700 1.71589030 -1.82248113 0.1659092


gene2
#################
               Estimate Std. Error    t value  Pr(>|t|)
(Intercept) 0.722996516 3.19455200 0.22632172 0.8354948
Age         0.005226481 0.05895096 0.08865812 0.9349405
SexM        3.127177700 1.71589030 1.82248113 0.1659092

ADD REPLY • link 3.7 years ago by Kevin Blighe 88k

score 1 · Answer 2 · 2018-08-16

Q: However, I need to be sure absolutely sure that this is not down to confounding factors.

A: There are ways to minimize the possibilities of being confounded by technical covariates, but my understanding is that there is no way you can zero out the possibility of being confounded. That's why high impact journals always ask for an independent validation cohort.

Q: Is there a statistical test, more advanced than the t-test, that can be applied that will ACCOUNT for the impact of these confounding covariates, and make sure that I really have found statistically significant correlations with Schizophrenia? If there is not, then can you recommend how I could change my procedure to best guard against the confounding factors?

A: Before thinking about what kind of test, it's probably better to just write out the null hypothesis that you are testing. In general, the hardest part is deriving a null background that is not BS. In that case, you probably should show us the distribution of the data. The expression data usually have normal distribution after dropping the 0 and log2 transform. TL/DR: a simple regression model with indicator variables for the discrete variables should do the job for your case. At least the coefficients will be useful for telling you how each factor is contributing to the gene expression. Wald test will also show whether the factors are "significant" in the regression model or not.