Univariable

Question

Performing univariate and multivariate logistic regression in gene expression data

2

Entering edit mode

5.9 years ago

dodausp ▴ 190

Hi everyone, I have recently moved to clinical settings in a hospital, and as my first assignment I am working with microarray data (Affymetrix) from cancer patients and I am facing a big challenge: how to adapt my experience in basic research to clinical application?

As, I presume, a lot of people here, I am quite familiar with using limma for linear model in microarray and RNAseq data for finding differentially expressed genes (DEGs). The matrix design is quite straight forward. However, I have been introduced to the methodology on our clinical settings and this is the pipeline they use for discovering DEGs:

univariate logistic regression (P<0.05)
to find DEGs that are associated to the cancer type
less stringent to allow for more genes to be discovered
multivariate logistic regression (P<0.01 or less)
analysis performed only for those genes that passed the first analysis
more stringent analysis to sort of validate what was found on the first step
multivariate analysis (P<0.05 or less)
performed only on those genes that passed the second analysis
multivariate de facto: considering other variates such as levels of biomarkers (e.g. CA-125), age and weight
ROC/AUC curves
performed only for those genes that passed analysis 3

Quite frankly I am a bit lost with all those steps (i.e. why not start from step 2 directly?), and consequently in what packages and functions to use in R. I have tried this thread, and this one, but I believe they do not answer my questions to the fullest. So, any help in that regard (i.e. which tools to use, how to design the contrast matrix) will be extremely appreciated!

One last thing, for microarray data would one use gene expression as a response variable and quantitative traits (i.e. normal, cancer) as explanatory, or the other way around? Doesn't limma automatically set the former case?

Thanks a lot in advance!

gene expression logistic regression R Affymetrix • 7.3k views

ADD COMMENT • link updated 3.6 years ago by Kevin Blighe 89k • written 5.9 years ago by dodausp ▴ 190

score 12 · Accepted Answer · 2019-04-08

Note July 22, 2021: I have answered for univariable and multivariable, assuming that you meant these instead of univariate and multivariate. For an interesting discussion, see here: https://stats.stackexchange.com/questions/447455/multivariable-vs-multivariate-regression

Hey,

I will try to be as brief as possible and give you general points. Firstly, you may find this previous answer an interesting read: What is the best way to combine machine learning algorithms for feature selection such as Variable importance in Random Forest with differential expression analysis?

Univariable

This obviously just involves testing each variable (gene) as an independent predictor of the outcome. You have Affymetrix microarrays. For processing these, you should default to the oligo package. affy is another package but it cannot work with the more modern 'ST' Affymetrix arrays.

Limma is still used to fit the regression model independently to each gene / probe-set. A simple workflow may be (you will have to adapt this to your own situation):

library("limma")

# 'oligo' is more suited for the Gene ST Affymetrix arrays
library("oligo")

# Read in the data
# readTargets will by default look for the 'FileName' column in the specified file
targetinfo <- readTargets("Targets.txt", sep="\t")
CELFiles <- list.celfiles("SampleFiles/", full.names = TRUE)

# Raw intensity data
project <- read.celfiles(CELFiles)

# Background correct, normalize, and calculate gene expression
project.bgcorrect.norm.avg <- rma(project, target="probeset")

# Create the study design and comparison model
design <- paste(targetinfo$CellPopulation, sep="")
design <- factor(design)
design

comparisonmodel <- model.matrix(~0+design)
colnames(comparisonmodel) <- levels(design)
comparisonmodel

# Extract the control probes from the chip
ControlProbes <- grep("AFFX",featureNames(project.bgcorrect.norm.avg))
ControlProbes

# Fit the linear model on the study's data
# lmfit uses generalized least squares (GLS), which is better for heteroscedastic data (i.e., data where the variances of sub-populations are different)
project.fitmodel <- lmFit(exprs(project.bgcorrect.norm.avg), comparisonmodel)

# Applying the empirical Bayes method to the fitted values acts as an extra normalization step and aims to bring the different probe-wise variances to common values
# This ultimately reduces the degrees of freedom of variances (i.e., less individual, different, variances).
project.fitmodel.eBayes <- eBayes(project.fitmodel)
names(project.fitmodel.eBayes)

# Make individual contrasts and fit the new model
Q1_contrastmodel <- makeContrasts(Q1="NM-NL", levels=comparisonmodel)
Q1_contrastmodel.fitmodel <- contrasts.fit(project.fitmodel.eBayes, Q1_contrastmodel)
Q1_contrastmodel.fitmodel.eBayes <- eBayes(Q1_contrastmodel.fitmodel)

topTable(Q1_contrastmodel.fitmodel.eBayes, adjust="fdr", coef="Q1", number=9999999, p.value=1)

multivariable

Once you have identified statistically significant independent predictors (genes), you may then want to build a multivariable model using these statistically significant genes. This is where we enter the realm of multivariable logistic regression. I have posted a few answers:

Essentially, you may build a single combined model using glm() (MASS package) and then use some procedure to further reduce the number of predictors. Techniques used here are varied but include:

stepwise regression
Lasso-penalised regression
other classifiers like RandomForest™

In these models, you can include other variables, such as age and weight, and anything else that you want. The best model may well consist of a mixture of genes and clinical parameters.

When you do this, you must ensure that your variables are encoded correctly, i.e., as continuous or categorical variables. If they are categorical, then the categories (factors) must be ordered as per your own decision. The reference level is always positioned first in the ordered list.

ROC analysis

Once you have settled on a final model(s), you can test the AUC. This can be done with pROC, with examples here:

Hey, why don't you go and impress your new colleagues by also calculating sensitivity, specificity, precision, and accuracy, too? - How to calculate the sensitivity and specificity of set of expression dataset used in a validation cohort

Kevin