Tutorial:Survival analysis with gene expression
0
109
Entering edit mode
6.1 years ago

For this example, we will load GEO breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene's expression into Low | Mid | High based on Z-scores and compare these against RFS in a Cox Proportional Hazards (Cox) survival model.

1, read in and prepare data

library(Biobase)
library(GEOquery)

# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
x <- exprs(gset[[1]])

# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]

# transform the expression data to Z scores
x <- t(scale(t(x)))

# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
  c('age:ch1', 'distant rfs:ch1', 'er:ch1',
    'ggi:ch1', 'grade:ch1', 'node:ch1',
    'size:ch1', 'time rfs:ch1'))

metadata <- data.frame(pData(gset[[1]])[,idx],
  row.names = rownames(pData(gset[[1]])))

# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any( is.na(x) ))
metadata <- metadata[!discard,]

# filter the Z-scores expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]

# check that sample names match exactly between pdata and Z-scores 
all((colnames(x) == rownames(metadata)) == TRUE)
## [1] TRUE

# create a merged pdata and Z-scores object
coxdata <- data.frame(metadata, t(x))

# tidy column names
colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
  'GGI', 'Grade', 'Node',
  'Size', 'Time.RFS')

# prepare phenotypes
coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))

2, test each gene independently via Cox regression

With the data prepared, we can now apply a Cox survival model independently for each gene (probe) in the dataset against RFS.

Here we will use RegParallel to fit the Cox model independently for each gene.

library(survival)
library(RegParallel)

res <- RegParallel(
  data = coxdata,
  formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
  FUN = function(formula, data)
    coxph(formula = formula,
      data = data,
      ties = 'breslow',
      singular.ok = TRUE),
  FUNtype = 'coxph',
  variables = colnames(coxdata)[9:ncol(coxdata)],
  blocksize = 2000,
  cores = 2,
  nestedParallel = FALSE,
  conflevel = 95)
res <- res[!is.na(res$P),]
res

3, annotate top hits with biomaRt

Filter by Log Rank p<0.01

res <- res[order(res$LogRank, decreasing = FALSE),]
final <- subset(res, LogRank < 0.01)
probes <- gsub('^X', '', final$Variable)

library(biomaRt)
mart <- useMart('ENSEMBL_MART_ENSEMBL', host='useast.ensembl.org')
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(mart = mart,
  attributes = c('affy_hg_u133a',
    'ensembl_gene_id',
    'gene_biotype',
    'external_gene_name'),
  filter = 'affy_hg_u133a',
  values = probes,
  uniqueRows = TRUE)

annotLookup

Two of the top hits include CXCL12 and MMP10. High expression of CXCL12 was associated with good progression free and overall survival in breast cancer in doi: 10.1016/j.cca.2018.05.041, whilst high expression of MMP10 was associated with poor prognosis in colon cancer in doi: 10.1186/s12885-016-2515-7.

4, encode statistically significant genes as Low | Mid | High and plot survival curves

# extract RFS and probe data for downstream analysis
survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
  'X203666_at', 'X205680_at')]

colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
  'CXCL12', 'MMP10')

# set Z-scale cut-offs for high and low expression
highExpr <- 1.0
lowExpr <- -1.0
survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
  ifelse(survplotdata$CXCL12 <= lowExpr, 'Low', 'Mid'))
survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
  ifelse(survplotdata$MMP10 <= lowExpr, 'Low', 'Mid'))

# relevel the factors to have mid as the ref level
survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
  levels = c('Mid', 'Low', 'High'))
survplotdata$MMP10 <- factor(survplotdata$MMP10,
  levels = c('Mid', 'Low', 'High'))

library(survminer)

ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
  data = survplotdata),
  data = survplotdata,
  risk.table = TRUE,
  pval = TRUE,
  break.time.by = 500,
  ggtheme = theme_minimal(),
  risk.table.y.text.col = TRUE,
  risk.table.y.text = FALSE)

a

ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
   data = survplotdata),
   data = survplotdata,
   risk.table = TRUE,
   pval = TRUE,
   break.time.by = 500,
   ggtheme = theme_minimal(),
   risk.table.y.text.col = TRUE,
   risk.table.y.text = FALSE)

b

cox gene-expression survival-analysis geo • 41k views
ADD COMMENT
2
Entering edit mode

@Kevin Blighe:

Many thanks for your community contribution in Biostars, this thread is very informative and helpful to learn RNA-Seq analysis.

ADD REPLY
1
Entering edit mode

Kevin,

Unless there is a problem on my end, I think something may have gotten deprecated here. Running code as is only gives me mid and high curves for both genes.

Inspection of

survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS', 'X203666_at', 'X205680_at')]

shows that no samples meet the -1 zscore low expression cutoff (as far as I can see). I don't really have any questions about this. I just thought I would point it out just in case it is a repeatable error.

ADD REPLY
0
Entering edit mode

Hey, that is strange - thanks for the alert. I also just re-ran my own code and observe the same 'phenomenon'. Nothing surprises me anymore in bioinformatics, though. Bioinformatics is like the Wild Wild West.

Here are the new survival curves for this tutorial:

CXCL12

1

-------------------------

MMP10

2

ADD REPLY
0
Entering edit mode

Kevin,

I actually do have a quick question related to this now that I think about it (if you have time). I see you have your expression factor with three levels:

survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
levels = c('Mid', 'Low', 'High'))

In theory this was supposed to produce three curves. Lets say I have a similar multi leveled expression factor that produces multiple curves and I want to do a test that makes a pairwise comparison of every single curve. ie low vs mid, mid vs high etc.

I should just be able to run this command at endpoint which as I understand gives a benjamini hochberg adjusted log-rank test p value for every possible comparison of the multiple curves.

pairwise_survdiff(Surv(Time.RFS, Distant.RFS) ~ CXCL12, data=survplotdata)

Is this reasonable?

Thank you

ADD REPLY
0
Entering edit mode

Yes, well, in the example above (my example), we could have done it better by dividing the expression range into tertiles to ensure that there would be at least 1 sample per group. I just chose a hard cut-off of Z=1, though. The tutorial is just to foment ideas, though.

I am not familiar with pairwise_survdiff() but it looks like a useful function. Thanks for mentioning it here.

ADD REPLY
0
Entering edit mode

Hi Kevin. Hope you good. I adapted your code to find the high, low and mid expressions of 14 genes. Can you please help me with a tutorial on how to conduct a pairwise survival plot possibly one that can pair say high level of TPL2 and VEGFA and low level of IGFBP3?

I already tried this but I didnt understand most of it

http://rstudio-pubs-static.s3.amazonaws.com/5896_8f0fed2ccbbd42489276e554a05af87e.html

Thanks.

ADD REPLY
0
Entering edit mode

I am unsure what you mean, but you can create a multivariable Cox model of the following form:

 ~ TPL2 + VEGFA + IGFBP3

...or, just create a new variable that contains every possible combinatino of high | low for these genes and then just use that in the Cox model.

ADD REPLY
0
Entering edit mode

Hi Kevin, I will like to perform a multivariate analysis with my genes and I am thinking of using of high expression as z> 0 and low expression as z<= 0 in order to omit the mid expression bit. Does this look sound? I have been considering using the median as the cut off point as most studies have done but does that mean I have to find the median for all the genes to generate the survival curves?

And could you please help me with a tutorial on how to perform a box plot analysis with my data? I will like to use that to help me understand the expression profile of genes (i.e which ones are highly or low expressed among patients).

Thank you.

ADD REPLY
0
Entering edit mode

Hey again. >0 and <=0 is, essentially, a binary classification. It is not ideal but may have to be used for some genes with. I mean, a value of 0.25 is just 0.25 standard deviations above the mean value, which is not high. Aiming for something like >1.96 and < -1.96 would be better, as |Z|=1.06 is equivalent of p=0.05.

For box-and-whiskers plots, I am not sure... how about this? - A: Boxplot in ggplot2

ADD REPLY
0
Entering edit mode

Ok thanks. What about using the median as the cut-off point? This is because with the previous cut off points 1.0 and -1.0, most of the patients fell into the mid expression group which left very few patients with the high and low expression of genes?

ADD REPLY
0
Entering edit mode

I understand now. Thanks

ADD REPLY
0
Entering edit mode

Median can be used, too, and is better to use the median for non-parametric variables. Keep in mind that, sometimes, scaling (like I do in this tutorial) is not the best approach, and that, in place of this, maintaining the variables on their original scale is better.

ADD REPLY
0
Entering edit mode

Ok so I tried executing a code like this:

survplotdata$VEGFA<-ifelse(survplotdata$VEGFA> median, 'High',ifelse(survplotdata$VEGFA<= median, 'Low', 'Mid'))
survplotdata$VEGFA<- factor(survplotdata$vegfa,levels = c( 'Low', 'High')) 
ggsurvplot(survfit(Surv(OverallSurvival_months,Death)~VEGFA,data=survplotdata),data=survplotdata,risk.table=FALSE,pval=TRUE,ggtheme=theme_pubr(), risk.table.y.text.col=TRUE,risk.table.y.text=FALSE,xlab='Time (months)')

I realised that the curves generated were in line with what I was expecting ie high VEGFA corresponded with low survival and also it split my sample size into two for high risk and low risk. However, due to the answer given by Tom L. I found on the page below, I didnot go through with this.

determining cutoff for Kaplan Meier

ADD REPLY
1
Entering edit mode

Everybody has an opinion on everything. Edit: Tom's opening paragraph makes no sense to me, as, by splitting the gene expression by the median, it's in no way implying that "50% of patients will survive in your analysis". By splitting the gene expression by the median, we are just aiming to determine how higher or lower gene expression relates to survival / relapse.

ADD REPLY
1
Entering edit mode

I totally agree with you on the everyone has an opinion on everything part. My head has been splitting on all the differing views I get. Thanks by the way. Really appreciate it.

ADD REPLY
0
Entering edit mode

Dear Dr. Kevin,

As in the K-M plot clear, after running ggsurvplot we plot Kaplan Meyer which we can see a p-value on it. Here for "MMP10", the p-value equals 0.00047 in your example. I ran the same as your code for my target gene and also ran the Cox Proportional-Hazards Model for that.

In my case, the p-value resulted from the Cox regression is 0.04 but the p-value resulted ggsurvplot for the K-M plot is about 0.1. based on Cox's p-value my study is significant but based on the K-M plot p-value isn't(greater than 0.05). is that results logically acceptable? If yes which p-value should be ignored and which one accepted?

I appreciate it if you share your comment with me.

ADD REPLY
0
Entering edit mode

Please show the exact code that you have used in order to clearly show from where you are deriving your p-values.

ADD REPLY
0
Entering edit mode

For cox regression I use below code:

     fit_SARC_turquoise <- coxph(Surv(coxSARCdata$OS.days, coxSARCdata$OS.status)~MEturquoise, data=coxSARCdata)
summary(fit_SARC_turquoise)

And by runnig that code I got below result:

    n= 71, number of events= 26 

                coef      exp(coef)     se(coef)      z        Pr(>|z|)  
     MEturquoise -0.4242    0.6543     0.2058        -2.061   0.0393 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

               exp(coef)  exp(-coef)  lower .95   upper .95
MEturquoise    0.6543      1.528       0.4371        0.9794

Concordance= 0.602  (se = 0.07 )
Likelihood ratio test= 4     on 1 df,   p=0.05
Wald test            = 4.25  on 1 df,   p=0.04
Score (logrank) test = 4.34  on 1 df,   p=0.04

As you see the P-Value(Pr(>|z|)) equal 0.0393. now in the following I performed K-M plot generating code:

#peparing subsetting MEturquoise from coxSARCdata
survplotSARCturquoisedata <- coxSARCdata[,c('OS.days', 'OS.status',
                                        'MEturquoise')]

survplotSARCturquoisedata$MEturquoise <- ifelse(survplotSARCturquoisedata$MEturquoise > highExpr, 'High','Low')

# plot based on Kaplan-Meier plot
ggsurvplot(survfit(Surv(OS.days, OS.status)~ MEturquoise, data=survplotSARCturquoisedata),
           data = survplotSARCturquoisedata,
           risk.table = TRUE,
           pval = TRUE,
           break.time.by = 500,
           ggtheme = theme_minimal(),
           risk.table.y.text.col = TRUE,
           risk.table.y.text = FALSE)

So, in the following link the result of K-M plot is accecible. and you can see P-value in the plot equals 0.25:

https://www.dropbox.com/s/8rn89ithvqfyfqk/Rplot_K-M_MEturquoise_OS_981018.bmp?dl=0

I appreciate it if you share your comment with me

ADD REPLY
0
Entering edit mode

These are different functions, so, you should not expect that they return the same p-values. Take a look here:

ADD REPLY
0
Entering edit mode

Dear Dr. Blighe Thanks for your comment. but as I wrote in the last line of summary(fit_SARC_turquoise) result you can find Score (log rank) test in which the p-value equals 0.04 by 1 df. but this log rank p-value is different from p-value in K-M plot in this link: https://www.dropbox.com/s/8rn89ithvqfyfqk/Rplot_K-M_MEturquoise_OS_981018.bmp?dl=0

I appreciate it if you share your comment with me

ADD REPLY
0
Entering edit mode

Is survplotSARCturquoisedata the exact same as coxSARCdata?

ADD REPLY
0
Entering edit mode

No, because coxSARCdata has a few columns and survplotSARCturquoisedata is a subset of coxSARCdata. you mean for that reason they don't have similar P-value. Ok, Dear Dr. Blighe, how can I interpret this unsimilarity of 2 log-rank P-value resulted from the Cox regression and K-M plot? Can I insert P-value resulted from Cox regression in the K-M plot picture instead K-M plot P-value?

ADD REPLY
0
Entering edit mode

Yes, you can add any p-value to the K-M plot - all that you need to do is:

...
pval = '0.04',
...

However, you need to be sure that this is the correct thing to do.

ADD REPLY
0
Entering edit mode

Dear Dr. Blighe, I have 2 more questions:

1- I need to show K-M plots for 7 genes in one picture. How can I do it?

2- I need to resize of Font of labels(Survival probability, time,..) in the K-M plot.

I appreciate it if you guide me that how can I do them via my code.

ADD REPLY
0
Entering edit mode

1- I need to show K-M plots for 7 genes in one picture. How can I do it?

2- I need to resize of Font of labels(Survival probability, time,..) in the K-M plot.

ADD REPLY
0
Entering edit mode

Does that help?

ADD REPLY
0
Entering edit mode

So I tried to perfom this analysis with my data:

#loading data from GEO gset <- getGEO('GSE17536', GSEMatrix =TRUE, getGPL=FALSE) x<-exprs(gset[[1]])

#removing Affymetrix control probes 
x<-x[-grep('^AFFX', rownames(x)),] 

#transforming expression data to Z scores 
x<-t(scale(t(x)))

# extract information of interest from the phenotype data (pdata)
idx<-which(colnames(pData(gset[[1]])) %in% c('age:ch1', 'ajcc_stage:ch1', 'dfs_event (disease free survival; cancer recurrence):ch1', 'dfs_time:ch1', 'gender:ch1', 'overall survival follow-up time:ch1', 'overall_event (death from any cause):ch1'))
metadata<-data.frame(pData(gset[[1]])[,idx],row.names=rownames(pData(gset[[1]])))

#removing samples with NA values
discard<-apply(metadata,1,function(x) anyis.na(x))) 
metadata<-metadata[!discard,] 

#filter the Z-scores expression data to match the samples in our data 
x<-x[,which(colnames(x) %in% rownames(metadata))] 

#checking that sample names match exactly bewteen pdata and Z-scores 
all((colnames(x)==rownames(metadata))==TRUE) 

#converting data to 0/1
metadata[1][metadata[1]=="recurrence"] <- 1
metadata[1][metadata[1]=="no recurrence"] <- 0

#creating merged data and Z-scores 
coxdata<-data.frame(metadata,t(x)) 

#tidying column names 
colnames(coxdata)[1:7]<-c('Age', 'Ajcc_stage', 'dfs_event', 'dfs_time', 'Gender', 'OverallSurvival_time', 'Overall_event')

#preparing phenotypes 
coxdata$dfs_event<-as.numeric(coxdata$dfs_event)
coxdata$dfs_time<-as.numeric(coxdata$dfs_time)
coxdata$OverallSurvival_time<-as.numeric(coxdata$OverallSurvival_time)
coxdata$Overall_event<-as.numeric(coxdata$Overall_event)

#testing each gene with the data (RFS first)
Res1<-RegParallel(data=coxdata,formula='Surv(dfs_time, dfs_event)~[*]',FUN=function(formula,data) coxph(formula=formula, data=data, ties='breslow',singular.ok=TRUE), FUNtype='coxph', variables=colnames(coxdata)[8:ncol(coxdata)],blocksize = 2000, cores=2,nestedParallel=FALSE, conflevel=95)

But I keep getting

index1: 54001; index2: 54613 Error in { : task 1 failed - "No (non-missing) observations" after the RegParallel command. Can you tell me why please? P. S: the dataset recorded dfs_event as 'recurrence' and 'no recurrence' and Overall_event as 'death' and 'no death'

ADD REPLY
0
Entering edit mode

Hey, I think that it means that you have a variable that has no values, i.e., a variable that has only NA or infinite values, Have you screened your input data to ensure that all variables are complete?

ADD REPLY
0
Entering edit mode

Hi I realised that whenever I executed the commands:

#preparing phenotypes 
coxdata$dfs_event<-as.numeric(coxdata$dfs_event)
coxdata$Overall_event<-as.numeric(coxdata$Overall_event),

the values for these columns would all change to NA. I did this a number of times and got the same result. Please do you know why this keeps happening?

ADD REPLY
0
Entering edit mode

Please ignore the comma at the end of the code.

ADD REPLY
0
Entering edit mode

That's a change introduced in R 4.0.0. I will have to modify the tutorial code.

You should try:

coxdata$dfs_event <- as.numeric(as.character(coxdata$dfs_event))
coxdata$Overall_event <- as.numeric(as.character(coxdata$Overall_event))
ADD REPLY
0
Entering edit mode

Hey I tried that as well after seeing on a platform like this but I got the same response. Tried again this morning and got the same NA problem.

ADD REPLY
0
Entering edit mode

I also restarted R and re-executed the codes but I keep getting the same response.

ADD REPLY
0
Entering edit mode

Hello again. What is the output of:

str(coxdata$dfs_event)
ADD REPLY
0
Entering edit mode
num [1:177] NA NA NA NA NA NA NA NA NA NA ...
ADD REPLY
0
Entering edit mode

I see, but this is not an issue with my tutorial. You need to properly encode your DFS variables. Here is the pData for your dataset:

df[,grep('dfs', colnames(df))]
          dfs_event (disease free survival; cancer recurrence):ch1 dfs_time:ch1
GSM437093                                            no recurrence       142.55
GSM437094                                            no recurrence       122.72
GSM437095                                            no recurrence        28.96
GSM437096                                            no recurrence       119.21
GSM437097                                            no recurrence        59.53
GSM437098                                               recurrence        27.02
GSM437099                                            no recurrence        82.29
GSM437100                                            no recurrence       109.21
ADD REPLY
0
Entering edit mode

Hello Kevin. Sorry am quite new to R. Please what do you mean when by properly encoding my DFS variables. I also tried to execute the code above and I got this instead:

> df[,grep('dfs',colnames(df))]
Error in df[, grep("dfs", colnames(df))] : 
  object of type 'closure' is not subsettable
ADD REPLY
0
Entering edit mode

I see.. trying to adapt this tutorial to your own data will prove difficult for people who are new to R.I recommend that you first go through the entire tutorial as I have presented (above) - in this way, you will be better equipped to later adapt the code to your own data.

ADD REPLY
0
Entering edit mode

Hi Kevin, I read the as.numeric(as.character(x)) converts my data from factor to character and then to numeric. So I tried this code:

dfs_event<-as.numeric(as.factor(metadata$dfs_event..disease.free.survival..cancer.recurrence..ch1))

hoping that the data will be converted from character to factor to numeric.

And I got this response: dfs_event

[1] 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 1 3
 [33] 2 2 3 2 2 3 2 2 3 2 2 2 2 2 3 2 2 3 3 2 3 2 2 2 2 2 2 2 2 2 2 2
 [65] 2 2 2 2 2 2 2 2 1 2 2 2 3 2 2 2 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3
 [97] 3 2 2 2 2 2 2 2 3 2 3 3 3 2 2 1 2 2 2 2 2 2 3 2 2 3 2 2 2 2 2 2
[129] 2 3 3 2 2 2 3 2 2 2 1 2 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 3 2 1 1 1
[161] 1 1 2 1 1 1 1 1 3 1 1 2 2 3 1 1 1

where 1: NA, 2: no recurrence, 3: recurrence. Am wondering if this will this affect my COX analysis?

ADD REPLY
0
Entering edit mode

Hello again. The Cox regression function that is used in this tutorial requires data to be:

0, no recurrence / no death
1, recurrence / death

As per my own code (above):

coxdata$Distant.RFS 
  [1] 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0
 [38] 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 1
 [75] 1 0 0 1 1 1 0 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0
[112] 0 1 1 1 0 1 0 0 0 0 0 1 0 1

You will have to encode your variable as 0 and 1.

Take a look at the sub() and gsub() functions

ADD REPLY
0
Entering edit mode

Hi. So this is what I eventually and it seemed to work:

coxdata$dfs_event<- as.numeric(as.factor(coxdata$dfs_event))
coxdata$dfs_event<-sub(2,1,coxdata$dfs_event)
coxdata$dfs_event<-sub(3,0,coxdata$dfs_event)
coxdata$dfs_event<-sub(1,'NA',coxdata$dfs_event)
coxdata$dfs_event<-as.numeric(coxdata$dfs_event)
coxdata$dfs_time <-as.numeric(coxdata$dfs_time)
coxdata$OverallSurvival_time<-as.numeric(coxdata$OverallSurvival_time)
coxdata$Overall_event<- as.numeric(as.factor(coxdata$Overall_event))
coxdata$Overall_event<-sub(2,0,coxdata$Overall_event)
coxdata$Overall_event<-as.numeric(coxdata$Overall_event)
ADD REPLY
0
Entering edit mode

Sure, but, where you use as.numeric(as.factor()) together in this way, you need to be careful about how it converts the factors into numbers - the behaviour may not always be what you expect

ADD REPLY
0
Entering edit mode

Hi Kevin. This may seem odd but I will like to know how R interprets:

ggsurvplot(survfit(Surv(OverallSurvival_months, Death)~TPL2...

and

ggsurvplot(survfit(Surv(OverallSurvival_months)~TPL2...

This is because when I used the second to plot a that had a p value of 0.0024 making the relation significant (which was expected) but the first plot gave a p value of 0.32. I got the first code from a friend who was helping me out.

ADD REPLY
0
Entering edit mode

I would indeed expect different p-values here because the parameters that are passed to Surv() are interpreted differently based on how many are passed. Take a look at ?Surv, or here: https://www.rdocumentation.org/packages/survival/versions/3.2-3/topics/Surv

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thank for the wonderful tutorial and your major contribution to Biostar. I am wondering whether you could explain what you meant by two of the top hits for CXCL12 and MMP10. Are these selected based p-value?

I saw many other genes had the very smaller p-value compared with CXCCL12 and MMP10? I am just a little confused myself and hope you could show why you selected only CXCL12 and MMP10. Hope you can help. Thank in advance.

Kind Regards,

synat

ADD REPLY
0
Entering edit mode

Hello Kevin Blighe I would like to get your help in this

ADD REPLY
0
Entering edit mode

One typo was found: discard <- apply(metadata, 1, function(x) anyis.na(x))) should be discard <- apply(metadata, 1, function(x) anyis.na(x)))

ADD REPLY
1
Entering edit mode

Thank you for noticing that. My raw code was actually correct - the error (the lack of an extra parenthesis, (), was introduced in the visual representation of my code by the Biostars rendering system. I have added a space, and it now looks fine.

ADD REPLY
2
Entering edit mode

Gud one Kevin. I have a question. On what basis Z-scale cutoff 1.0 is selected?

BTW In this tutorial [http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html] they have used maxstat (Maximally selected rank statistics) for the cutpoint to classify samples into high and low.

ADD REPLY
1
Entering edit mode

That looks like a good tutorial (through the link that you posted).

The selection of absolute Z=1 was just chosen as a very relaxed threshold for highly / lowly expressed. It is difficult to know where the exact cut-offs should be, and of course biology does not intuitively work on cut-off points. The tutorial above is for fomenting new ideas for survival analysis.

ADD REPLY
0
Entering edit mode

Hi Kevin, do you think this method will work in this case as well. I was worried that it might not work since the gene expression levels have been standardized.

ADD REPLY
0
Entering edit mode

Dear Dr. Blighe

Thank you for this tutorial. I have a question about using Scale() for transforming expression data to Z scores. basically, why do we need transforming to z scores while our original data(downloaded from GEO) is normal?

I appreciate if you share your comment with me.

Mohammad

ADD REPLY
3
Entering edit mode

Hello Mohammad. The conversion to Z-scores provides for an easier interpretation on the expression range for each gene. For example, on the Z-scale, we know that +3 equates to 3 standard deviations above the mean expression value in the dataset.

In a normal distribution that is not transformed to Z-scale, a value of 10, 20, 30, et cetera may mean nothing in the context of the expression range.

The relationship between a normal distribution and the Z-scale is emphasised in this beautiful figure:

d

[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]

ADD REPLY
0
Entering edit mode

Dear Dr. Blighe

Thanks for your answer. I have another questions about your SA tutorial due to using RNA-seq expression data:

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

2- honestly, I cant understand '~ [*]' in formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on RegParallel(), can I compute 'res' using my phenotype fields? if yes, how can I use these fields in RegParallel()?

I deeply appreciate if you share your comment with me.

Mohammad

ADD REPLY
2
Entering edit mode

1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?

You should aim to transform your normalised RNA-seq counts via the variance-stabilised or regularised log transformation (if using DESeq2), or produce log CPM counts (if using EdgeR). Then, you can generally use glm(), as I use above. If you are aiming to use the normalised, un-transformed counts, then you could use the negative binomial regression via glm.nb() - this may be too advanced, though. Check the manual (via ?RegParallel) and vignette for RegParallel.

Remember that, in RNA-seq, the general process goes:

  1. raw counts -->
  2. normalised counts (statistical analyses performed on these) -->
  3. transformed, normalised counts (for downstream analyses, clustering, PCA, etc.)

2- honestly, I cant understand '~ [*]' in formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'

This is annotation specific to my package, RegParallel. Each gene will replace the [*] symbol as the package tests each gene in an independent model. Again, please read the manual and vignette.

3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on RegParallel(), can I compute 'res' using my phenotype fields? if yes, how can I use these fields in RegParallel()?

Then you are likely aiming to do a survival analysis. In that case, you can use coxph(). Your commands would be:

OS <- RegParallel(
    data = coxdata,
    formula = 'Surv(OS.days, OS.status) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata),
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

...or:

RFS <- RegParallel(
    data = coxdata,
    formula = 'Surv(RFS.days, RFS.status) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata),
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

Note, you will likely have to change the value to variables. Variables is a vector of gene names that you want to test. It can be any number. If using RegParallel, the idea is that you have hundreds or thousands or millions of genes to test.

ADD REPLY
0
Entering edit mode

"normalised counts (statistical analyses performed on these) -->" i have this doubt found that you mentioned about it , are you saying about this function "counts(dds, normalized=TRUE)" whose value can be used for any non parametric statistical lest?

As of now i used mostly rlog and vst value for clustering and pca etc . Not sure if i can use any test on the rlog or vst value , as im not sure if its mathematically correct thing to do with the rlog or vst value or not...

If you can clarify it would be really helpful

ADD REPLY
1
Entering edit mode

No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

  1. fit negative binomial regression model independently for each gene's normalised counts
  2. extract p-value from the model coefficient via the Wald test applied to the model

*there are variations to this

One can, of course, produce normalised, transformed counts, and perform their own analyses on these.

ADD REPLY
0
Entering edit mode

"No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:

fit negative binomial regression model independently for each gene's normalised counts extract p-value from the model coefficient via the Wald test applied to the model" yes this part im clear as i read the same in the paper

"of course, produce normalised, transformed counts, and perform their own analyses on these." yes for this one as i get certain genes and i want to make comparison between biological sample .So if i do that comparison running some non parametric test then its not a problem , I guess

ADD REPLY
0
Entering edit mode

I see. Which test are you doing?

ADD REPLY
0
Entering edit mode

I tried wilcoxon rank test as of now...

ADD REPLY
1
Entering edit mode

Seems okay to me. Ask 10 people and you'll get 10 different answers, though. Each answer is based on the respective experience of the individual.

ADD REPLY
0
Entering edit mode

one doubt is clear now

ADD REPLY
0
Entering edit mode

Dear Dr. Blighe

I use TPM(Transaction per million) method for normalizing my RNA-Seq data set. base on your perfect tutorial I ran RegParallel() for getting survival analysis. I need your comment for 2 below questions:

1- I use 'coxph' as FUNtype for the regression model. is it a suitable function for my problem. if no, which function is your suggestion?

2- As you know in literature, we have multivariate Cox regression and univariate Cox regression. based on RegParallel(), our Survival Analysis is multivariate or univariate?

It would be really helpful If you can clarify me.

ADD REPLY
0
Entering edit mode

Yes, coxph is the correct function. TPM is not too bad if you are testing each gene independently, i.e., univariable (in my tutorial, above, each gene is tested independently as part of a univariable Cox model);

ADD REPLY
0
Entering edit mode

Ok, Thanks for your comment. logically, doing multivariate Cox Regression for lots of genes(more than 150 genes) is true? if you agree, how can I run it? can you guide me by tutorial such as the above tutorial?

ADD REPLY
0
Entering edit mode

If you want to run a multivariable Cox model with that many variables, then consider using the Lasso Cox model: https://cran.r-project.org/web/packages/glmnet/vignettes/Coxnet.pdf

ADD REPLY
0
Entering edit mode

Dear Dr. Blighe

Really Thanks for your answer. But about my first question, I would like to explain more about my data set. It belongs to TCGA and I downloaded as UQ-FPKM. In RNA-seq analysis, this type of data set is normal. So, for using that I transformed it to Log2 space.

1- now, for using this data should I scale() for transformation to z-score?

2- based on my explanationabout TCGA data, which functions are better: glm() or glm.nb()?

3- why you didn't use coxph() for RNA-seq expression data set in RegParallel vignett? can I use this function for my data set?

I deeply appreciate if you share your comment with me.

Mohammad

ADD REPLY
0
Entering edit mode

Sorry, this is not how Biostars functions. I expect you to read my comments and to then spend some time researching the answers to any further questions that you have. That is the best form of learning.

ADD REPLY
0
Entering edit mode

Dear Dr.Blighe

Here you design Survival plot for 2 genes: 'MMP10' and 'CXCL12'. Suppose that we have a bunch of gene and after clustering we have n cluster. how can we design Surv plot for each cluster separately?

I appreciate if you share your solution with me

ADD REPLY
0
Entering edit mode

What do you mean by 'n cluster'?

ADD REPLY
0
Entering edit mode

n is number of cluster. For example 3 cluster(n=3).

ADD REPLY
0
Entering edit mode

If you encode the gene's expression as a factor / categorical variable, then the survival function will plot a curve for each level

ADD REPLY
0
Entering edit mode

Hi Kevin, I am curious to ask can we use Beta values for methylation from each probe instead of the read-count from gene expression.

If yes, these values are continuous and range from 0 to 1, would it be recommended to convert these also to Z score.

ADD REPLY
0
Entering edit mode

Hey, yes, you could use the Beta values from methylation for the purposes of survival analysis. I think that it is okay to leave the values as 0 to 1.

ADD REPLY
0
Entering edit mode

Thanks Kevin, I tried your suggestion and was able to identify prognostic CpG sites. I did the same using gene expression data and interestingly found some overlapping genes. But I am not very sure how to integrate these two results as methylation can regulate the expression of genes that are in trans. In order to address that, checking just the overlap would not work. I do not know how should I proceed. I will really appreciate if u can share your thoughts about it.

Thanks B

ADD REPLY
0
Entering edit mode

Hi Dr. Blighe, My survplotdata is as below:

             OS.days     OS.status     MEbrown
TCGA_K1_A3PO    1622         1       -2.33294095
TCGA_IE_A4EI     594         1        0.58743015
TCGA_DX_A6B7    1143         1       -0.54661312
TCGA_HB_A5W3     384         0       -0.54716366
TCGA_WK_A8XS    2579         1        0.78227254
TCGA_X6_A8C5    1547         1       -1.38013989
TCGA_QQ_A5VC    1092         1       -0.35551612
TCGA_DX_A48U    3310         1       -1.10047385
TCGA_DX_A3UA     325         0       -0.31312280
TCGA_DX_A3UC    2085         1        1.06574187
TCGA_DX_A48R     805         1        0.88281156
TCGA_PC_A5DO    2464         0        0.06537203
TCGA_IE_A4EK      42         1       -0.80047920
TCGA_DX_A6Z2    3079         1       -0.16302605
TCGA_K1_A42W    1891         1        0.52228371

I used 0 as cut-offs for high and low expression

 highExpr <- 0
 lowExpr <- 0

When I run this code:

survplotdata$MEbrown <- ifelse(survplotdata$MEbrown > highExpr, 'High',
                                  ifelse(survplotdata < lowExpr, 'Low'))

I get this Error:

Error in ifelse(survplotdata < lowExpr, "Low") : 
  argument "no" is missing, with no default

I appreciate if you guide me and share your comment for solving that Error with me.

ADD REPLY
1
Entering edit mode

Take a look at ?ifelse

ADD REPLY
0
Entering edit mode

Thanks, Dr. Blighe. I solved my problem but in the below code:

ggsurvplot(survfit(Surv(OS.days, OS.status) ~ MEbrown,
                   data = survplotdata),
           data = survplotdata,
           risk.table = TRUE,
           pval = TRUE,
           break.time.by = 500,
           ggtheme = theme_minimal(),
           risk.table.y.text.col = TRUE,
           risk.table.y.text = FALSE)

I get below Error:

Error in as.data.frame.default(x[[i]], optional = TRUE) : 
  cannot coerce class ‘"by"’ to a data.frame
In addition: Warning messages:
1: In .pvalue(fit, data = data, method = method, pval = pval, pval.coord = pval.coord,  :
  There are no survival curves to be compared. 
 This is a null model.

I appreciate if you guide me and share your comment for solving that Error with me.

ADD REPLY
0
Entering edit mode

Okay, please spend some more time to debug the error on your own. Check the encoding of your variables, and check what survfit() and ggsurvplot() expect.

ADD REPLY
0
Entering edit mode

Hi Kevin, thanks for creating this package. I'm learning survival analysis, and am finding your tutorial is very helpful.

I was wondering regarding your suggestion to arrange the tests by log rank p value. From my understanding, the log rank test is computed comparing survival time between groups. So in the RegParallel function, is gene expression being dichotomized? If so, how exactly---is it using Z-score +/- 1? Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?

Thanks, Victor

ADD REPLY
0
Entering edit mode

So in the RegParallel function, is gene expression being dichotomized? If so, how exactly---is it using Z-score +/- 1?

No, the package just accepts whatever data that you use. It can be continuous or categorical. It is just in this tutorial that I dichotomise the gene expression values before using the RegfParallel package.

Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?

You can do whatever approach seems valid to you. There is no correct or incorrect approach.

ADD REPLY
0
Entering edit mode

without clinical information this is not possible to do so isn;t it? If i look at the microarray data of liquid tumor they dont give information as such as you have used here. Is there still a way to run survival analysis ?

ADD REPLY
0
Entering edit mode

Hey, what information do you have, exactly?

ADD REPLY
0
Entering edit mode

so far the microarray data for AML have checked are mostly array expression, they dont give the clinical information of the patients which in this case you have for the breast cancer data set.

ADD REPLY
0
Entering edit mode

Hey kelvin, this is a great tutorial. This is my first time for this kinda analysis, can you please tell how to use data obtained from TCGA both count and clinical data for this analysis.

Thanks

ADD REPLY
1
Entering edit mode

For quick and easy analysis, you can simply use a website like cBioPortal or oncolnc.org

If you want to do it yourself, here's a good tutorial: Survival analysis of TCGA patients integrating gene expression (RNASeq) data

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thank you very much for this helpful tutorial. I spent some time to figure out how to do this analysis before coming across your post. I'd appreciate if you can comment on my approach and please let me know if you find it inaccurate.

I downloaded TCGA RNAseq and miRNAseq data and used voom transformation as follows:


dge_rna <- DGEList(counts = rna_assay, samples = rna_sample_meta)  
keep_rna <- filterByExpr(dge_rna)  
dge_rna <- dge_rna[keep,,keep.lib.sizes=FALSE]  
dge_rna <- calcNormFactors(dge_rna)  
v_rna <- voom(dge_rna, plot=F)
norm_rna <- v$E

dge_mir <- DGEList(counts = mirna_assay, samples = mirna_sample_meta)  
keep_mir <- filterByExpr(dge_mir)  
dge_mir <- dge_mir[keep,,keep.lib.sizes=FALSE]  
dge_mir <- calcNormFactors(dge_mir)  
v_mir <- voom(dge_mir, plot=F)
norm_mir <- v_mir$E

Then I combined these normalized data with clinical parameters such as vital_status and days_to_death to perform survival analysis. I'm recycling this code for 30 separate tumors as a general approach, thus I don't have a predetermined design. I am also trying to calculate correlations between protein-coding-gene vs miRNA pairs to find associations. For my purposes do you think voom normalization is appropriate?

Best, Atakan

ADD REPLY
0
Entering edit mode

Hi Atakan, yes, if I was using data deriving from EdgeR, then I would use the 'voom' expression levels. That is, the voom levels would represent the 'coxdata' object in my tutorial.

ADD REPLY
0
Entering edit mode

Dear Kevin, excellent and comprehensive tutorial as always !! I have three quick questions regarding the implementation of your tutorial: briefly, based on the TCGA-GDC RNA-Seq dataset of breast cancer, i have identified a very small number of genes (~5) with significant differences in overall survival, based on the stratification of cancer samples as high vs low. My next goal is to search additional datasets-even microarrays-to test the same hypothesis, as also if subtype available to correlate it also with survival. Thus, my quick questions are the following:

1) Regarding the pre-processing of microarray data-you scaled only the data, as you have downloaded an already normalized gene expression matrix correct ?

2) I saw you have performed cox regression on relapse-free survival- checked also from the supplementary material, that some of the patients have not received any type of therapy-thus, from my goal and perspective, I can still perform survival using RFS, even to test if these genes exhibit a correlation with survival associated with therapy, even if it is not overall survival ?

3) Even if i have specific gene targets, I can still perform cox regression to investigate if these genes illustrate a significant outcome associated with survival ?

Best,

Efstathios

ADD REPLY
1
Entering edit mode

1) Regarding the pre-processing of microarray data-you scaled only the data, as you have downloaded an already normalized gene expression matrix correct ?

Yes, that is correct, i.e., the data is already normalised (and log [base 2] transformed).

2) I saw you have performed cox regression on relapse-free survival- checked also from the supplementary material, that some of the patients have not received any type of therapy-thus, from my goal and perspective, I can still perform survival using RFS, even to test if these genes exhibit a correlation with survival associated with therapy, even if it is not overall survival ?

Yes, you can perform survival analysis using any metric. It can be 'days to relapse', 'days to death', 'days to first disease occurrence', etc. The term 'survival' was always somewhat misleading.

3) Even if i have specific gene targets, I can still perform cox regression to investigate if these genes illustrate a significant outcome associated with survival ?

Yes, and you can include all genes in the same model, or test each gene independently, i.e., in separate models

ADD REPLY
0
Entering edit mode

Dear Kevin,

thank you very much for your answer !! do you think that based on the experimental design of this dataset-that is the majority of the patients have undergone initial therapy-RFS would be a more "robust" estimate of survival,as essentially if measuring overall survival, is more related to patients without any therapy ? and then I can assume if a statistically significant RFS survival appears, that any gene related is implicated in survival mechanisms related to therapy ? as a measure of resistance ?

ADD REPLY
1
Entering edit mode

I cannot confidently answer these follow up questions

ADD REPLY
0
Entering edit mode

Hi Kevin,

Great tutorial, thanks so much for taking the time to write and share it. I would like to ask a question just to clarify my understanding. Apologies if this is very simple/obvious, I am coming from a pure biology background with not much statistical training.

Am I correct in thinking your code is performing a univariate analysis on each gene? I can see the model is looping to test each variable separately, and that the variables are defined as each gene in the below line:

variables = colnames(coxdata)[9:ncol(coxdata)]

However I am struggling the understand, whether/where the phenotype data (age, ER status, grade etc) is being used by the model. Is it referenced by assigning the data as the full 'coxdata' dataframe, as below?

res <- RegParallel(
    data = coxdata,
    ...
    data = data,
...

If so, is this different from passing the phenotype data as an explicit variable(s) and performing a multivariate analysis on each gene in conjunction with the phenotype data?

I appreciate any advice or direction to further reading to improve my understanding!

Thanks again,

Sian

ADD REPLY
1
Entering edit mode

Hey Sian, yes, it performs a univariate test on each gene / variable that is passed to the variables parameter. This is the same as any standard differential expression program.

If you want to adjust for a covariate, say, ER-status, then you would do something like:

res <- RegParallel(
    data = coxdata,
    formula = 'Surv(Time.RFS, Distant.RFS) ~ [*] + ERstatus',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata)[9:ncol(coxdata)],
    blocksize = 2000,
    cores = 2,
    nestedParallel = FALSE,
    conflevel = 95)

I'm aware that the syntax of this package's commands is not too easy to interpret but, in certain respects, I wanted it to be that way in order to avoid any mis-use. To use it, one has to have a general understanding of regression modeling, i suppose.

PS - that will output a line for ERstatus for each gene, so, you may want to automatically exclude those model terms via the excludeTerms parameter.

ADD REPLY
0
Entering edit mode

Hi Kevin. Am back again lol. Is it possible to test the high and low expression of the genes with each of the phenotype data?

ADD REPLY
0
Entering edit mode

Hello again, trust that you are well. I am not sure what you mean, but it sounds like you want to stratify your cohort into high and low, and then re-run it separately?

ADD REPLY
0
Entering edit mode

No please. I want to perform an ANOVA test (I think) to show the relation between the high and low expression of my genes (18 in all) and the phenotype data separately, that is age, gender, UICC and grading (2 or 3).

ADD REPLY
2
Entering edit mode

Then we are talking about a binary logistic regression model:

glm(gene ~ age, ...)
glm(gene ~ gender, ...)

, where gene is encoded as high | low.

Or, you could possible have:

lm(age ~ gene, ...)
glm(gender ~ gene, ...)
ADD REPLY
0
Entering edit mode

Yes please. Could you help me with a tutorial on how to do this please? I used the code

Res<- Regparallel(data=coxdata, formula= 'Surv(OverallSurvival_months, Death)~[*]+Age+Grading+Gender+UICC' ....

But I realised it only shows the relation between the genes as a whole (but not dichotomized into high and low expression) and each of the phenotype data.

ADD REPLY
0
Entering edit mode

I see. So, you need to perform the dichotomisation prior to running RegParallel. This is covered in Part 4 (above), but you will have to find a way to loop over all genes in your input data

ADD REPLY
1
Entering edit mode

Ok. I will try a create a new data frame with the dichotomized genes and the phenotype data. Hope it works out.

ADD REPLY
0
Entering edit mode

Fingers crossed

ADD REPLY
0
Entering edit mode

So I tried doing this,

#creating a new data frame for objective 6
Age<- coxdata$Age
Death<-coxdata$Death
Gender<-coxdata$Gender
Grading<-coxdata$Grading
UICC<-coxdata$UICC
OverallSurvival_months<-coxdata$OverallSurvival_months
Rezidiv<-coxdata$Rezidiv 
TumorFreeSurvival_months<-coxdata$TumorFreeSurvival_months
VEGFA<-res.cat1$VEGFA
ATF2<-res.cat1$ATF2
TPL2<-res.cat1$TPL2
VCAM1<-res.cat1$VCAM1
CXCL2<-res.cat1$CXCL2
HLX1<-res.cat1$HLX1
IGFBP3<-res.cat1$IGFBP3
NDRG1<-res.cat1$NDRG1
coxdata2<-data.frame(Age, Death, Gender, Grading, UICC, OverallSurvival_months, Rezidiv, TumorFreeSurvival_months, VEGFA, ATF2, TPL2, VCAM1, CXCL2, HLX1, NDRG1, IGFBP3)

#testing each gene with the data 
Res1<-RegParallel(data=coxdata2, formula='Surv(OverallSurvival_months, Death)~[*]+Age+ Gender+ Grading+ UICC', FUN=function(formula,data) coxph(formula = formula, data=data, ties='breslow', singular.ok=TRUE), FUNtype='coxph', variables=colnames(coxdata2)[9:ncol(coxdata2)], blocksize=1, cores=2, nestedParallel=FALSE, conflevel=95)

Where the various gene names represent the respective gene columns with the expression values replaced with 'high' and 'low'. But I got this response instead:

##############################
#RegParallel
##############################

System is:
-- Windows
Blocksize:
-- 1
Cores / Threads:
-- 2
Terms included in model:
-- OverallSurvival_months
-- Death
-- Age
-- Gender
-- Grading
-- UICC
First 5 formulae:
-- Surv(OverallSurvival_months, Death) ~ VEGFA + Age + Gender + Grading + UICC
-- Surv(OverallSurvival_months, Death) ~ ATF2 + Age + Gender + Grading + UICC
-- Surv(OverallSurvival_months, Death) ~ TPL2 + Age + Gender + Grading + UICC
-- Surv(OverallSurvival_months, Death) ~ VCAM1 + Age + Gender + Grading + UICC
-- Surv(OverallSurvival_months, Death) ~ CXCL2 + Age + Gender + Grading + UICC
Processing 1 formulae, batch 1 of 9
-- index1: 1; index2: 1
Processing 1 formulae, batch 2 of 9
-- index1: 2; index2: 2
Processing 1 formulae, batch 3 of 9
-- index1: 3; index2: 3
Processing 1 formulae, batch 4 of 9
-- index1: 4; index2: 4
Processing 1 formulae, batch 5 of 9
-- index1: 5; index2: 5
Processing 1 formulae, batch 6 of 9
-- index1: 6; index2: 6
Processing 1 formulae, batch 7 of 9
-- index1: 7; index2: 7
Processing 1 formulae, batch 8 of 9
-- index1: 8; index2: 8
Processing final batch 9 of 9
-- index1: 9; index2: 8
Error in { : task 9 failed - "undefined columns selected"
ADD REPLY
0
Entering edit mode

Are there only 9 genes in your dataset? In that case, I would literally just write out the models individually. RegParallel was really designed for datasets containing 1000s of variables and/or where 1000s or millions of different tests needed to be performed.

It should work based on how you have set it up, though. I wonder could you try to install the current development version and retry the same code:

devtools::install_github('kevinblighe/RegParallel')
ADD REPLY
0
Entering edit mode

After multiple tries, I keep getting this:

package ‘Rcpp’ successfully unpacked and MD5 sums checked
Error: Failed to install 'RegParallel' from GitHub:
  (converted from warning) cannot remove prior installation of package ‘Rcpp’

Oh and you were right about testing the genes individually because of the new data frame. It worked when I tried. Thanks a lot AGAIN.

ADD REPLY
0
Entering edit mode

Good that you got it working. The Rcpp issue may relate to a rights issue, as Rcpp requires installation of system files. No big issue though

ADD REPLY
0
Entering edit mode

Hello Dr. Kevin. Thank you very much for these tutorials. You helping thousands of students from all over the world (Here one from Spain).

I've adapted your code to my HTA 2.0 microarray studio. And I've gone from having 350 candidate genes to 35 genes that influence patient survival. My boss told me I might be able to reduce the number of genes using a multivariable model.

My question is whether your code can be used with a penalized COX multivariable model. The study I am doing is with prostate cancer, and I have many clinical factors that may be helpful (PSA, alkaline phosphatase etc.)

Thank you again

ADD REPLY
0
Entering edit mode

Hola, ¿cómo estás?

The idea of this tutorial is to perform Cox PH independently for each gene, i.e., it is univariate, and this can help to reduce a large number of variables, in your case, 350 to 35.

A penalised Cox regression would take all 350 genes concurrently. The 'final' list of genes would be those whose coefficients are not shrunk (reduced) to 0. You would do this via the glmnet package.

I think that both methods are compatible with each other. After you do the penalised Cox regression, you can still plot the survival curves for some of the genes that make it to the final list.

ADD REPLY
0
Entering edit mode

Muy bien gracias! :P Thank you for your reply. Yes, I will do that. Do you know of any tutorials for doing the penalized Cox regression? I haven't found anything on the Internet applied to genes and clinical data.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thank you for you reply. I got it! But now, one more question. Now that I have the genes identified, I want to validate them with a validation set samples. What method would you use? I have taken my genes that affect patient survival and used them using the clinical data from the validation set patients, and nd I get a 0.9 AUC in ROC. But I think this method is not optimal, right?

ADD REPLY
0
Entering edit mode

Not optimal in which way? You should derive the confidence intervals around the AUC, too.

ADD REPLY
0
Entering edit mode

At first, I used that model with validation patient set to see if the ROC was still high. However, I read that this is not correct, as I am redoing the coefficients, not validating them.

To do a validation, I found this package that allows you to do internal and external validation. https://cran.r-project.org/web/packages/hdnom/vignettes/hdnom.html#2_build_survival_models

ADD REPLY
1
Entering edit mode

For that part, which is somewhat outside of my knowledge area, you may want to ask a question on a stats forum, like CrossValidated. I am actually only relatively recently working in internal and external calibration, so, I do not feel it is my place to provide advice right now.

ADD REPLY
0
Entering edit mode

Hello agan @kevin. Finally I could validate my gene model in the external validation dataset. My question now is: Is there a parsimonious method to reduce the number of genes without having an effect on the final ROC? For example, my gene model has 34 candidates. I would like to know if all 34 are essential or if I can reduce that number without affecting the AUC.

Thank you in advance

ADD REPLY
0
Entering edit mode

Hi Kevin How calculate FDA in COX-PH regression!!!?

ADD REPLY
0
Entering edit mode

What is FDA?

ADD REPLY
0
Entering edit mode

Dear Kevin Blighe , after running RegParallel i got this warning:

In coxph.fit(X, Y, istrat, offset, init, control, weights = weights,  :
Loglik converged before variable  1 ; coefficient may be infinite

and then i got 6000 genes with LogRank < 0.05, which is really strange (GSE had a total of 54000 genes) , is there something wrong?

however, i got the following curve:

enter image description here

and, may you help with another question:

i have used median cutoff for RNA seq data via the code below, but for quantile how can i change the code?

median_value = median(clin_df$gene_value)
clin_df$gene = ifelse(clin_df$gene_value >= median_value, "High expression", "Low expression")

it did work with quantile(clin_df$gene_value), but with warnings and i don't think this is the right way,

Thank you for your kind responses.

ADD REPLY
0
Entering edit mode

Kevin Blighe Extremely thankful to you. Amazing post; really helped me understand the basic of reading K-M plot w.r.t gene expression. Thank you Sir. :-}

ADD REPLY
0
Entering edit mode

Thank you very much for the comments, kountaydwivedi

ADD REPLY
0
Entering edit mode

Hi Kevin, Kevin Blighe

I noticed that what you used for the calculation of the z-score: x <- t(scale(t(x))), produces different result from the scale() function and different result if someone uses the definition of z-score; for example let's say our feature is: "feature" and our data is "data", then the z-score of the "feature" can be calculated by: data$feature_scaled = (data$feature- mean(data$feature))/sd(data$feature). The definition and the scale() produce exactly the same result as it was expected.

Do you have any input on that and why you used the x <- t(scale(t(x))) instead of using the: x <- scale(x) ?

Thanks in advance for your time.

Bests, Thanos

ADD REPLY
0
Entering edit mode

Hi Thanos, The definition of the scale() function is:

scale is generic function whose default method centers and/or scales the columns of a numeric matrix.

So, it only by default scales the columns, on a per column basis. The alternative would be global scaling based on the mean and sdev of the entire matrix.

So, if we want to scale a matrix by row, we must use:

t(scale(t(mat)))
ADD REPLY
0
Entering edit mode

Hi Kevin, Kevin Blighe

Thanks for the clarification. If I understood well, your approach is that given a dataset with features instead of scaling each feature using information from only that feature (using scale() function to calculate z-score), you used information from all the columns to calculate the z-score, since you calculated the mean and the std for the whole matrix aka all the features. If that's the case, I should mention that this approach introduces bias on each datapoint. It correlates each value of a feature with any value of any other feature. In general, it is something that we never do at least in machine learning/deep learning. But maybe I got something wrong here.

In any case, thank you for the clarification.

Bests, Thanos

ADD REPLY

Login before adding your answer.

Traffic: 2576 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6