If you have already derived a panel of genes that are statistically significantly differentially expressed, then there are various tools to use.
continuous outcome
If you are modelling a continuous trait (e.g. blood levels of inflammatory markers, metabolite levels, etc.), then build a linear model using lm()
:
For example:
lm(MyContinuousTrait ~ gene1 + gene2 + gene3 + ... + geneX, data=MyData)
You can then do stepwise regression to refine the model further based on the AIC or BIC using stepAIC()
from MASS package.
If you have many genes (e.g. > 30), then first test each gene independently, followed by building a combined model (like above) with the statistically significant genes:
lm(MyContinuousTrait ~ gene1, data=MyData) # p=0.8
lm(MyContinuousTrait ~ gene2, data=MyData) # p=0.03
lm(MyContinuousTrait ~ gene3, data=MyData) # p=0.67
lm(MyContinuousTrait ~ gene4, data=MyData) # p=0.02
MyModel <- lm(MyContinuousTrait ~ gene2 + gene4, data=MyData)
With this final model, you can still do stepwise regression to refine it even further.
For performing QC on this model::
library(car)
# check for outliers based on Cook's distances (2 different plots)
plot(cooks.distance(MyModel))
cutoff <- 4 / ((nrow(MyData) - length(MyModel$coefficients) - 1))
plot(MyModel, which=4, cook.levels=cutoff)
# Bonferonni p-value for most extreme observations
outlierTest(MyModel)
# check for non-normality via QQ plot for 'studentised' residuals
qqPlot(MyModel, main="QQ Plot")
# evaluate homoscedasticity
# non-constant error variance test
ncvTest(MyModel)
# multi-collinearity via variance inflation factors
vif(MyModel)
# a VIF > 2.5 indicates that the R^2 of a predictor to all other predictors is 0.6, i.e., collinear
sqrt(vif(MyModel)) > 2.5
# non-linearity via component + residual plot
crPlots(MyModel)
# test for Autocorrelated Errors
durbinWatsonTest(MyModel)
# global test of model assumptions
library(gvlma)
gvmodel <- gvlma(MyModel)
summary(gvmodel)
# k-fold cross-validation (here 3-fold)
library(DAAG)
cv.lm(MyData[!is.na(MyData$MyOutcomeTrait),], MyModel, m=3)
# R^2 shrinkage using 10-fold cross-validation
library(bootstrap)
theta.fit <- function(x, y) { lsfit(x,y) }
theta.predict <- function(fit, x) { cbind(1,x) %*% fit$coef }
x <- as.matrix(MyData[,which(colnames(MyData) %in% names(summary(MyModel)$coefficients[,1]))])
y <- as.matrix(MyData[,c("MyOutcomeTrait")])
results <- crossval(x, y, theta.fit, theta.predict, ngroup=10)
# Raw R^2
cor(y, MyModel$fitted.values)**2
# cross-validated R^2
cor(y, results$cv.fit)**2
Finally, if you have built a linear model, you can make a nice plot with ggplot2 using the following code:
MyData$MyContinuousTrait.pred = predict(MyModel)
ggplot(MyData, aes(x=MyContinuousTrait, y=MyContinuousTrait.pred)) +
geom_point() + xlim(0,5) + ylim(0,5) +
geom_smooth(method='lm',formula=y~x) +
stat_smooth(method="lm", fullrange=TRUE) +
ggtitle("")
- X-axis: real values for the outcome variable
- Y-axis: model predicted values
categorical outcome
If your outcome is categorical, i.e., a binary trait or has greater than 2 categories, then first build a binary or multinomial logistic regression model using glm(..., family="binomial")
with your predictors (genes). You can then choose the best predictors again based on the AIC or BIC using stepwise regression.
The checks for a glm()
differ from those for a lm()
:
# McFadden R^2 index used to assess the model fit
library(pscl)
pR2(MyModel)
# perform Chi^2 ANOVA
anova(MyModel, test="Chisq")
# perform cross validation
# the delta values should not greatly differ
# K=number of samples, i.e., leave-one-out CV.
library(boot)
cv.glm(MyData, MyModel, K=nrow(MyData))$delta
---------
For both lm()
and glm()
- based models, after you have chosen your best predictors and defined a final model, you can derive odds ratios (with confidence intervals) with:
# get ORs and confidence intervals
exp(cbind(OR=coef(MyModel), confint(MyModel)))
You can also perform ROC analysis using the pROC package.
Other options that can be used instead of lm()
/glm()
are lasso-penalized, elastic-net, and ridge regression (glmnet package).
Thanks for the suggestions, Kevin.
It's a categorical factor. We integrated mRNA and and miRNA data with SNFtool to generate three clusters that show DEG, pathway, and survival differences. However for validation purposes, it's difficult to find independent datasets in the wild which are of our specific tissue type and which have mRNA, miRNA, and survival data. So to work around this, they want to generate mRNA signatures for each and perform single sample GSVA on an independent dataset and determine if enrichment scores correlate with outcome.
For the glm, and excuse me if this is a silly question, are you suggesting something like a multivariate regression model (e.g., gene_1 + gene_n <- factor), or multiple regressions (e.g., gene_1 <- factor, gene_2 <- factor)? I assume this is already addressed within the stepAIC() documentation, maybe?
Hey, yes, I was suggesting a multivariable model such as:
With stepAIC you first define your full model, minimal model (which can be just
glm(factor ~ 1)
) and then run either a forward, backward, or both stepwise regression. This then returns a final model that it deems to be most predictive of your factor of interest, based on either the AIC or BIC criteria. It may return just a few of your genes - who knows. Based on the fact that your genes are already DEGs, I would hope that your final model is pretty good. A ROC curve using the pROC package should also show this with a high AUC.Also, stepwise regression is not great for very large numbers of variables. If you have around 20 or 30 DEGs, then that's optimal, I believe. Also, some statisticians will probably frown if you just let the stepwise function do everything for you. You should check the model summaries with the
summary()
function.Also, now that I know that it's a multinomial logistic regression model, the QC on the model differs. Here is some further code that may assist in assessing the model (NB - I have updated my initial answer, above, to include this):
Obviously your choice of factor reference level is key too.
Trust that this assists you.
Oh ok, I see. This makes a lot of sense.
Unfortunately the number of DEG is high (~2k unique across all three groups). They are tumor samples, so heterogeneity is expected. At the very least it makes sense to go back and double check the #DEG I have is accurate, and contrast the limma results with an alternate method like DESeq. I could also take the top 10 ranked DEG by logFC in each SNF group, or apply more conservative limits on logFC and/or adjPvalue.
Worse case scenario, assuming I end up with a large number of possible predictors, would it be helpful to do a random forest instead of the stepwise regression and address potential over-fitting by limiting the max number of randomly selected predictors (e.g., 20-30) included in each random decision tree? The issue I see with this is that the number of permutations I'd have to run to cover possible combinations of predictors would be astronomical. Maybe instead of doing this, I could take the DEG and do a cox regression for each and select top 30 with largest absolute hazard ratios? Ultimately what we are most interested in is whats driving the survival differences, so this seems to be a logical data driven approach to reduce the number of potential genes.
Again thanks for your thoughts on this, this is very helpful for me.
You seem to have a good grasp of the options available already. Stepwise regression is definitely not suitable with that many variables; however, lasso-penalized regression would work and may give you only a couple dozen variables in return. With the lasso regression (alpha=1.0), you can additionally adjust the alpha 'penalty' parameter in order to change the number of predictors that are thrown out (and thus the number that remain). Alpha of 0 is ridge regression, where no variables are thrown out of the model and each is assigned a beta value; whereas anything in between 1.0 and 0.0 is elastic regression.
In light of the fact that it's cancer data and that your interest is in survival, the cox proportional hazards regression idea is good and I assume you meant that you'd test each independently and then pick the top 30? That's certainly another acceptable way to do this type of thing. Edit 5th February 2019: I later wrote a tutorial for just this purpose: Survival analysis with gene expression
Random Forests, I cannot comment too much - it's one area that I'm yet to cover. I had an American colleague in Boston who did some great work on those but I did not get a chance to learn from his ideas/efforts.
Great suggestion. I think I'm going to try the lasso regression and see what I pull out.
Hi, I was wondering how you implemented this in the end? I am faced with a similar task at the moment and I am wondering how to do this properly. Thanks!
You mean lasso regression? I share some code/ideas here:
There are also tutorials online, if you search via a search engine.
Thanks so much for the help! I was wondering if you have any recommendations concerning sample size (RNA-seq) for this to properly work?
'The more, the merrier', basically! What sample size have you got, currently?
Regarding variables, you had mentioned stepwise regression in your other thread. From my experience, when the variables gets toward 40 or 50, one does then require lasso regression.
Well, we are designing the experiment at the moment - having said that I am restricted by money and can't sequence too many samples. Hence I am wondering what a minimum and good sample size would be for training and test set...
concerning stepwise regression - I was thinking about preselecting 40-50 genes by testing the DEGs independently with cox proportional hazards regression and then pick the top 40-50 genes. I would then feed them into stepwise regression - does this make sense? Something similar was suggested in this thread.
Thank so much for your input!
Yes, you could do that. I've done that in the past, i.e., differential expression analysis (or the Cox model in your case), followed by stepwise regression and model building of your top hits.
Sample sizes in RNA-seq / expression studies don't have to be as large as GWAS / genetic studies. Anecdotally, they say that you should have minimum 3 replicates per group, so, n=6 in a simple case-control study. Just do as many as you can but don't leave the groups unbalanced. That is, if you can afford 20 RNA-seq samples, then do 10 per group.
naive question what is "Mymodel" is it the input matrix ?
MyModel is neither a matrix nor data-frame. It is the regression model, created with either
lm()
orlm()
. Hope that this helps!okay..now it helped me to get a hang of it i was seeing the code but unable to figure it out let me try and let you know and this is the first error i encounter actually im using your code which you made regression
so the model is listModels not sure if im doing it correct or not
figured out the error the object which i should give is model not list model..some success..as of now
what is this
"cv.lm(MyData[!is.na(MyData$MyOutcomeTrait),], MyModel, m=3)"
theMyOutcomeTrait
You figured it out? - that's great, dear friend
not fully but im moving at a snail's pace....i updated my doubt
Okay.
MyOutcomeTrait
is the variable that you're modelling, i.e., they
variable.For example:
Hi Kevin,
I used the gene counts matrix with 7 rows - samples and 16749 columns - genes as datainput. Using two genes I build model
I followed the code, works fine, at the cross validation step getting error.
I given the default sample size of 7. May be the error is due to order of matrices are incorrect ? How do I rectify this error. Instead of taking whole genecounts should I have to choose the dataset only with the significant DGE's ?
That is for the cross-validated R2? What was the output of
DAAG::cv.lm
?You could avoid
crossval()
for now. How does the plot for Predicted versus Real values appear?Yes, the plot shows blank.
I am sorry - this is difficult for me to debug while remote. You will have to go back through your code. Keep in mind that this type of work is quite advanced.
That's ok,
Hi Kevin Blighe, allow me to add comment here but can I know if I have more than 1 categorical outcomes, i.e. whether my future samples will have diseaseA, diseaseB, and so on, how should I proceed?