Sure, that's possible, but I don't know to which distribution your data has been normalised. In this example below, I just produce binomially-distributed random data and model it as such. In DESeq2 for RNA-seq, data is modeled as a negative binomial family with adjustments for size factors. DESeq2 also includes a dispersion parameter for each gene for statistical inference, but I guess that you don't need to do that here. You have not stated exactly how your data was normalised.
Let's create some fake data:
fakedata <- matrix(rbinom(10*20, 100, .1), ncol=10)
rownames(fakedata) <- paste0("sample", c(1:nrow(fakedata)))
fakedata <- data.frame(c(rep("control", 10), rep("case", 10)), fakedata)
colnames(fakedata) <- c("CaseControl", paste0("gene", c(1:10)))
head(fakedata, 14)
CaseControl gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
sample1 control 10 12 6 10 13 11 12 10 8
sample2 control 6 17 8 10 10 14 10 8 8
sample3 control 10 15 7 10 2 11 7 14 11
sample4 control 13 7 7 16 17 8 10 9 6
sample5 control 13 10 11 9 14 9 12 8 11
sample6 control 13 7 9 11 11 10 8 6 8
sample7 control 5 14 9 7 5 11 4 13 13
sample8 control 19 11 15 11 10 9 11 13 12
sample9 control 13 8 15 8 13 13 5 12 9
sample10 control 10 13 5 11 9 7 8 10 7
sample11 case 11 13 10 8 13 13 9 9 7
sample12 case 10 10 12 17 13 11 13 6 9
sample13 case 8 13 12 10 10 11 15 13 12
sample14 case 10 12 15 11 11 7 7 11 13
-------------------------------------------------
Convert our outcome to categorical and ensure that 'control' is the reference level:
fakedata$CaseControl <- factor(fakedata$CaseControl, levels=c("control","case"))
-----------------------------------------------
Check the distribution
hist(data.matrix(fakedata[,2:ncol(fakedata)]))

Looks binomial to me; so, we will set the family as binomial during modelling.
----------------------------------------------
What we can then do is fit a logistic regression model for each gene and derive a p-value via the Wald test. Here, I am just doing it for gene1. DESeq2 does this for each gene via fitNbinomGLMs.
model <- glm(CaseControl ~ gene1, data=fakedata, family=binomial(link="logit"))
Take a look at the model coefficients:
coef(model)
(Intercept) gene1
2.3563115 -0.2374463
-----------------------------------------------
We then apply the Wald test on whichever model coefficient ('term') we want. Our gene is the second coefficient:
library(aod)
wald.test(b=coef(model), Sigma=vcov(model), Terms=2)
Wald test:
----------
Chi-squared test:
X2 = 1.9, df = 1, P(> X2) = 0.17
Not statistically significant for gene1 in this random example.
-------------------------------------------------
Note that the Wald test can actually be used to derive a single p-value for the coefficients for more than 1 gene combined:
model <- glm(CaseControl ~ gene1 + gene5 + gene7, data=fakedata, family=binomial(link="logit"))
coef(model)
(Intercept) gene1 gene5 gene7
-0.25601459 -0.22555748 -0.03603168 0.29738610
wald.test(b=coef(model), Sigma=vcov(model), Terms=2:4)
Wald test:
----------
Chi-squared test:
X2 = 4.3, df = 3, P(> X2) = 0.23
Trust that this assists. Note that if your data has indeed been normalised to a negative binomial distribution, then you can fit a model via the glm.nb function from the MASS package. DESeq2's model functions are quite specific for that particular program, from what I gather.
Kevin
Dear Kevin, Thanks a lot for your comprehensive answer. I really do appreciate it! I am going to try it right away. So far from what I have read, I just got one simple question, is there a way in R that I pass all my variables at once for my linear model instead of using ~ x1,+x2+.... ? or I should loop over it ?
Thanks a lot again !
I believe you should loop over your genes and test each independently via the Wald test. After you have processed all genes, then you could think about creating a 'merged' model with multiple statistically significant genes in the same formula. That final formula then can be considered a preliminary 'gene signature', that needs to be further put to the test in terms of its predictive ability to distinguish or categorical variables of interest. I have posted some code on that topic, too: A: Resources for gene signature creation
To help with the creation of a loop, take a look here: One-way ANOVA in R for many observations Essentially you create a formula via the
as.formula
function, and supply that toglm
. For the purposes of automation and ease of output, you can directly access the Wald test p-value with:For example, you can append that to an output file or simply add it to a vector or data frame with each loop.