Question

Exploring association between genes by their expression

0

Entering edit mode

5.5 years ago

dodausp ▴ 190

Hi, everyone

I have a dataset on gene expression of cancer patients, which I performed Cox regression for each gene to find their association with overall survival. I came down to a shortlist of around 1,500 candidates. To my mind, if one could show which genes from that shortlist are associated with each other, and then test whether their combination are still associated with overall survival, that would give more meaning.

So, I thought about linear modeling, where all possible combinations of genes shoud be tested. However, I am a bit stalled with these issues:

linear models are limited to the amount of variables: which means this test cannot be performed on such a big amount of candidates (i.e. lm() or glm() function);
in this case, would linear model (or a variant of it) be the method of choice? Which other method would you recommend?;
given the experience that a lot of you have here, does my rationale on how to handle this dataset make sense at all?

Any help is much appreciated! Thanks.

gene expression correlation R linear model • 2.3k views

ADD COMMENT • link 5.5 years ago by dodausp ▴ 190

0

Entering edit mode

Hi again!

This time, going a bit further with the analysis. I have performed the Cox regression with lasso penalty (glmnet()) and cross-validated with cv.glmnet(), and came out with a total of 31 candidates. To my understanding, and by plotting the K-M curves, those candidates show the best association with overall survival (OS). Say, these were the candidates:

X15605_i_at, X2000_at, X028_s_at, X201324_at, X2672_s_at, X203309_s_at, X203450_at, X2021_at, X204584_at, X074_at, X5118_at, X206347_s_at, X2355932_at, X204687_s_at, X2534305_at, X2568091_at, X132094_x_at, X2116006_x_at, X213138_x_at, X213615_at, X214876_at, X224941_at, X221766_at, X267438_at, X2386763_at, X250080_at, X420501_at, X721821_s_at, X241542_at, X202200_s_at, X367560_at

However, by finding the association of those genes among themselves that would give more meaning to the results. So, my idea was to run a simple multivariate Cox multivariate analysis (coxph()) with those candidates by:

res.cox.OS <- coxph(Surv(OS, OS_codex) ~ X15605_i_at + X2000_at + X028_s_at + X201324_at + X2672_s_at + 
                    X203309_s_at + X203450_at + X2021_at + X204584_at + X074_at + X5118_at + 
                    X206347_s_at + X2355932_at + X204687_s_at + X2534305_at + X2568091_at + X132094_x_at + 
                    X2116006_x_at + X213138_x_at + X213615_at + X214876_at + X224941_at + X221766_at + 
                    X267438_at + X2386763_at + X250080_at + X420501_at + X721821_s_at + X241542_at + 
                    X202200_s_at + X367560_at, data=shortlist_cox)

and found that (for illustration, I will just add those candidates with significant p-values):

> summary(res.cox.OS)
Call:
coxph(formula = Surv(OS, OS_codex) ~ X15605_i_at + X2000_at + X028_s_at + X201324_at + X2672_s_at + 
                    X203309_s_at + X203450_at + X2021_at + X204584_at + X074_at + X5118_at + 
                    X206347_s_at + X2355932_at + X204687_s_at + X2534305_at + X2568091_at + X132094_x_at + 
                    X2116006_x_at + X213138_x_at + X213615_at + X214876_at + X224941_at + X221766_at + 
                    X267438_at + X2386763_at + X250080_at + X420501_at + X721821_s_at + X241542_at + 
                    X202200_s_at + X367560_at, data=shortlist_cox, data = shortlist_cox)

  n= 196, number of events= 133 

                  coef exp(coef)  se(coef)      z Pr(>|z|)   
X15605_i_at   -0.517539  0.595985  0.217436 -2.380  0.01730 * 
X2000_at  0.426465  1.531834  0.149006  2.862  0.00421 **
X028_s_at  0.701472  2.016718  0.234395  2.993  0.00277 **
...
X204687_s_at    0.179346  1.196435  0.068728  2.610  0.00907 **
X2534305_at    0.712923  2.039946  0.259985  2.742  0.00610 **
X2568091_at   -0.465266  0.627968  0.141618 -3.285  0.00102 **
X132094_x_at  0.437751  1.549219  0.156252  2.802  0.00509 **
X2116006_x_at     0.380728  1.463349  0.144458  2.636  0.00840 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results show that there are 8 candidates (p<0.05) whose expression is associated to each other. Ok, now we have the information about those candidates (1) that are associated with OS and (2) whose gene expression is associated with each other. Based on HR ("exp(coef)") values from that table, these results also show for example that low expression of "X15605_i_at" (HR= 0.59) (gene A) or high expression of "X2000_at" (HR= 1.53) (gene B) are associated with a worse OS. And so forth.

Now, if I combine gene A AND gene B and artificially classify patients with low expression of A and high expression of B as "high risk" (and the other way around as "low risk) I obtain a very clear K-M with the difference between those 2 groups - the "high risk" group indeed has a worse OS compared to the "low risk".

However if I combine all 7 candidates to make my 2 groups of patients, there are only 3 patients (out of 196) that fullfill the criteria.

Finally, my question is: is there a way to do some sort of permutation/combination analysis coupled with Cox regression to find the combination of targets that best associates with OS? Considering that this combination of factors is represented say in at least 30% of my samples.

Sorry for the long post. That was the best way I could find to explain it. And as always, any light shed here will be greatly appreciated.

Thanks!

ADD REPLY • link 5.5 years ago by dodausp ▴ 190

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLY • link 5.5 years ago by GenoMax 147k

0

Entering edit mode

Thank you, genomax. The reason why I posted here it was because this is a follow-up question from the original question in this post. But you are right. Thank you for the info.

ADD REPLY • link 5.5 years ago by dodausp ▴ 190

score 6 · Accepted Answer · 2019-05-10

6

Entering edit mode

5.5 years ago

Kevin Blighe 88k

If there is no other way to reduce the number of independent predictors (i.e. setting a lower p-value threshold), then you could try the Cox model with lasso or elastic-net penalty, both implemented in the Coxnet package. This should be able to take the single model of all predictors and determine which are the 'best'. After you take the best predictors from this, you could then build a simple, final, Cox model of the handful of variables that remain.

Kevin

ADD COMMENT • link 5.5 years ago by Kevin Blighe 88k

1

Entering edit mode

Hi, Kevin Again, your insights are spot on. Thanks a lot! In regards to cutting down the amount of variables, I would like to run the following analysis with those candidates because they are already on a threshold of p<0.005. At least, I would like to give it a go on this first. I have read a bit further on the Cox model with lasso and elastic-net penalty. This is exactly what I was expecting. I do have to read more, because I am still lost with the full meaning of its outcome terms (e.g. lambda, nzero). I believe my main hurdle now is how to write the formula with my data in. I have prepared my data based on the requirements to run a glm() function. This is how it looks:

> shortlist_cox[1:10,1:10]
          Age    PFS PFS_codex     OS OS_codex X1552261_at X1552288_at X1552289_a_at X1552299_at 
S1       63.9 117.23         0 115.69        0    3.964586    5.759115      4.279965    6.785513      
S2       64.6  69.27         0  72.30        1    7.059853    7.413032      5.483407    6.945578     
S3       69.6   1.23         1   1.08        1    7.034529    6.259935      4.827770    5.823854      
S4       72.4  15.27         1  69.63        1    6.933748    6.677647      5.087919    6.977539      
S5       43.5  61.43         1  78.41        1    7.335500    5.962986      3.759109    6.361736      
S6       88.3  14.73         1  42.90        1    7.516774    6.237404      4.403221    5.657177      
S7       82.6  14.87         1  31.00        1    6.375484    9.837775      8.215100    7.443165      
S8       49.1  35.00         1  51.12        1    6.700741    6.895096      5.258566    6.495558      
S9       57.9   6.60         1  11.01        1    7.018056    8.715041      6.709328    6.564994      
S10      64.4  16.57         1  27.68        1    6.225923    7.482588      5.312034    7.014927

And these are different lines that I tried running it with no success:

fit <- glmnet(survfit(Surv(OS, OS_codex), data = shortlist_cox), family="cox", maxit = 1000)

Error in Surv(OS, OS_codex) : object 'OS' not found

OR

fit <- glmnet(Surv(shortlist_cox$OS, shortlist_cox$OS_codex), family="cox", maxit = 1000)

Error in drop(y) : argument "y" is missing, with no default

I tried reading finding something similar, but I only came across this one, which is very nice, but still does not use an example for survival analysis. Although I do want to use the workflow described there for dowstream analysis.

Again, any help is much appreciated. And thanks a lot!

ADD REPLY • link 5.5 years ago by dodausp ▴ 190

1

Entering edit mode

Hey, yes, that other thread was my answer for glmnet. You should check the vignette for Cox models with lasso: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#cox

Here is how one can do it:

require(glmnet)
require(survival)

Then, following my example from here to get some data: Survival analysis via Cox Proportional Hazards regression

library(Biobase)
library(GEOquery)

# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
x <- exprs(gset[[1]])

# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]

# transform the expression data to Z scores
x <- t(scale(t(x)))

# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
  c('age:ch1', 'distant rfs:ch1', 'er:ch1',
    'ggi:ch1', 'grade:ch1', 'node:ch1',
    'size:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
  row.names = rownames(pData(gset[[1]])))

# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) anyis.na(x)))
metadata <- metadata[!discard,]

# filter the Z-scores expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]

# check that sample names match exactly between pdata and Z-scores 
all((colnames(x) == rownames(metadata)) == TRUE)

# create a merged pdata and Z-scores object
coxdata <- data.frame(metadata, t(x))

# tidy column names
colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
  'GGI', 'Grade', 'Node',
  'Size', 'Time.RFS')

# prepare certain phenotypes
coxdata$Age <- as.numeric(gsub('^KJ', '', coxdata$Age))
coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))

Now perform the Cox survival analysis with lasso:

coxdata[1:10,1:15]
         Age Distant.RFS ER       GGI Grade Node Size Time.RFS  X1007_s_at
GSM65752  40           0  0   2.48005     3    0  1.2     2280  0.61091071
GSM65753  46           0  1 -0.633592     1    0  1.3     2675  0.66320599
GSM65754  67           1  1  -1.02972     1    0    6      426  0.39016262
GSM65755  41           1  1   1.04395     3    0  3.3      182 -0.01019652
GSM65756  38           1  1  0.683559     3    0  3.2       46  0.71052390
GSM65757  34           0  1   1.05919     2    0  1.6     3952  0.92185361
GSM65758  46           1  1  -1.23306     2    0  2.1     1824  0.01764556
GSM65760  57           1  1  0.679034     3    0  2.2      699  0.23028475
GSM65761  63           1  1   0.31585     2    0  2.8      730  1.19710288
GSM65762  54           1  1  -1.17846     2    0  1.7      882  0.70807685
            X1053_at   X117_at   X121_at  X1255_g_at    X1294_at  X1316_at
GSM65752  0.80321777 0.9576007 0.3236762  0.20730041 -0.57647252 0.5170007
GSM65753  0.12525380 0.4033209 0.1810776  0.08599057  0.43754667 0.5622666
GSM65754 -0.24124193 0.4739672 0.4422982  0.24498162  0.01942667 0.8180497
GSM65755  0.78790284 0.3364908 0.2314423  0.04583503 -0.62393026 0.3874972
GSM65756  0.28652687 0.4308167 0.3995143  0.36486763 -0.71256876 0.9834658
GSM65757  0.35280009 0.5448720 0.2622013 -0.03252980  0.08691080 0.5999085
GSM65758  0.11911061 0.6634477 0.2808262 -0.02828385  0.49225720 0.7275434
GSM65760 -0.07668636 0.3542536 0.4505841  0.33018455  0.63742514 0.6240488
GSM65761  0.31422159 0.3428862 0.3198296  0.17794074  1.01971818 0.7655406
GSM65762 -0.19605026 0.4011981 0.2035890  0.08022851  0.14603593 0.4257209

coxlasso <- glmnet(
  x = data.matrix(coxdata[,10:ncol(coxdata)]),
  y = Surv(coxdata$Time.RFS, coxdata$Distant.RFS),
  family = 'cox')

That should do it!

ADD REPLY • link 5.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Yes! We did quite on a very similar way. Thanks a lot! (:

I now have another question, that I am posting below as an "answer". That because it is downstream to what I want to do next.

ADD REPLY • link 5.5 years ago by dodausp ▴ 190

0

Entering edit mode

Maybe I figured that out. These were the steps:

#First, converting any "0" in my OS (time in months) column to a very small number
#For some reason the glmnet() function doesn't like it
shortlist_cox$OS <- ifelse(shortlist_cox$OS==0, 0.001, shortlist_cox$OS)

#Then creating each object to be used on the glmnet() formula:
b <- paste(colnames(shortlist_cox)[6:10], collapse=" + ")
b <- as.formula(paste0("~ ",b))

x <- shortlist_cox[,6:10]
x <- model.matrix(b, shortlist_cox)
y <- Surv(shortlist_cox$OS,shortlist_cox$OS_codex)

And now, the function:

lasso_OS <- glmnet(x, y, family="cox", standardize=T, alpha=1)

And the cross-validation:

cv.lasso_OS <- cv.glmnet(x, y, family="cox", standardize=T, alpha=1, nfolds=10, parallel=TRUE)

There might be something else I need to further understand as in interpreting the data, but that will posted later.