Note July 22, 2021: I have answered for univariable and multivariable, assuming that you meant these instead of univariate and multivariate. For an interesting discussion, see here: https://stats.stackexchange.com/questions/447455/multivariable-vs-multivariate-regression
Hey,
I will try to be as brief as possible and give you general points. Firstly, you may find this previous answer an interesting read: What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?
Univariable
This obviously just involves testing each variable (gene) as an independent predictor of the outcome. You have Affymetrix microarrays. For processing these, you should default to the oligo package. affy is another package but it cannot work with the more modern 'ST' Affymetrix arrays.
Limma is still used to fit the regression model independently to each gene / probe-set. A simple workflow may be (you will have to adapt this to your own situation):
library("limma")
# 'oligo' is more suited for the Gene ST Affymetrix arrays
library("oligo")
# Read in the data
# readTargets will by default look for the 'FileName' column in the specified file
targetinfo <- readTargets("Targets.txt", sep="\t")
CELFiles <- list.celfiles("SampleFiles/", full.names = TRUE)
# Raw intensity data
project <- read.celfiles(CELFiles)
# Background correct, normalize, and calculate gene expression
project.bgcorrect.norm.avg <- rma(project, target="probeset")
# Create the study design and comparison model
design <- paste(targetinfo$CellPopulation, sep="")
design <- factor(design)
design
comparisonmodel <- model.matrix(~0+design)
colnames(comparisonmodel) <- levels(design)
comparisonmodel
# Extract the control probes from the chip
ControlProbes <- grep("AFFX",featureNames(project.bgcorrect.norm.avg))
ControlProbes
# Fit the linear model on the study's data
# lmfit uses generalized least squares (GLS), which is better for heteroscedastic data (i.e., data where the variances of sub-populations are different)
project.fitmodel <- lmFit(exprs(project.bgcorrect.norm.avg), comparisonmodel)
# Applying the empirical Bayes method to the fitted values acts as an extra normalization step and aims to bring the different probe-wise variances to common values
# This ultimately reduces the degrees of freedom of variances (i.e., less individual, different, variances).
project.fitmodel.eBayes <- eBayes(project.fitmodel)
names(project.fitmodel.eBayes)
# Make individual contrasts and fit the new model
Q1_contrastmodel <- makeContrasts(Q1="NM-NL", levels=comparisonmodel)
Q1_contrastmodel.fitmodel <- contrasts.fit(project.fitmodel.eBayes, Q1_contrastmodel)
Q1_contrastmodel.fitmodel.eBayes <- eBayes(Q1_contrastmodel.fitmodel)
topTable(Q1_contrastmodel.fitmodel.eBayes, adjust="fdr", coef="Q1", number=9999999, p.value=1)
multivariable
Once you have identified statistically significant independent predictors (genes), you may then want to build a multivariable model using these statistically significant genes. This is where we enter the realm of multivariable logistic regression. I have posted a few answers:
Essentially, you may build a single combined model using glm()
(MASS package) and then use some procedure to further reduce the number of predictors. Techniques used here are varied but include:
- stepwise regression
- Lasso-penalised regression
- other classifiers like RandomForestâ„¢
In these models, you can include other variables, such as age
and weight
, and anything else that you want. The best model may well consist of a mixture of genes and clinical parameters.
When you do this, you must ensure that your variables are encoded correctly, i.e., as continuous or categorical variables. If they are categorical, then the categories (factors) must be ordered as per your own decision. The reference level is always positioned first in the ordered list.
ROC analysis
Once you have settled on a final model(s), you can test the AUC. This can be done with pROC, with examples here:
Hey, why don't you go and impress your new colleagues by also calculating sensitivity, specificity, precision, and accuracy, too? - How to calculate the sensitivity and specificity of set of expression dataset used in a validation cohort
Kevin
Wow, this is just mesmerizing, Kevin! I immensely appreciate your help. In fact, this was more than that. It was a lecture. And that bonus in the end, for the sens. and spec., was just to top that all up. Thank you so much for taking the time to address my questions so thoroughly. I will go over it, together with the links you added, just as carefully. And I hope it is ok to come back if there's something I am stuck at. This community never ceases to amaze me. A thousand thanks!
Sure thing - no problem.
I bellieve am dealing with a complete or quasi-complete separation case. When trying to perform a multivariate analysis for each variable separately, by following the example, I am getting this:
I read some threads recommending bayesglm() or glmnet(). But first; I would like to know if there is anything I can do to optimize my data to use your proposed "Building a predictive model by a list of genes"?
PS1: the same issue appears when I run the combined model
PS2: the data being used is straight away from rma() normalization
PS3: I added the "control" argument to the glm() function because I was getting the following warning message:
Any help is much appreciated again.
How many variables are in the
glm()
? bayesglm and glmnet (lasso-penalised regression) can work with a large number of variables -glm()
cannot.I believe that was exactly my problem. I started with around 1,200 variables (genes), going down to 47 - cutting off by p-value. It worked on the last case. In that case, would you recommend to keep my list of selected variables as low as possible, or would it be worth trying the other methods - bayesglm and glmnet? I know it is briefly commented on this post. I am just afraid to lose some good predictors by excluding them to fit the glm() function. And many thanks again!
I think that it is worth trying various things, to be honest, and then comparing the AUC that you obtain for each. The key variables should be identified by each method.
When you get your 47 variables, by the way, that would be ideal for stepwise regression to further reduce that down. Stepwise regression will likely receive some frowns from Statisticians with whom you work, though. Just always check the models that it chooses and be acutely aware of the standard error.
If you also receive AUCs of 1 or they have very wide confidence intervals, be very suspicious. Same is true for confidence intervals from your models even before you generate ROC curves: if you see confidence intervals like 0.000004 or 12342.77, then that is almost definitely error.