For this example, we will load GEO breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene's expression into Low
| Mid
| High
based on Z-scores and compare these against RFS in a Cox Proportional Hazards (Cox) survival model.
1, read in and prepare data
library(Biobase)
library(GEOquery)
# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
x <- exprs(gset[[1]])
# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]
# transform the expression data to Z scores
x <- t(scale(t(x)))
# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
c('age:ch1', 'distant rfs:ch1', 'er:ch1',
'ggi:ch1', 'grade:ch1', 'node:ch1',
'size:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
row.names = rownames(pData(gset[[1]])))
# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any( is.na(x) ))
metadata <- metadata[!discard,]
# filter the Z-scores expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]
# check that sample names match exactly between pdata and Z-scores
all((colnames(x) == rownames(metadata)) == TRUE)
## [1] TRUE
# create a merged pdata and Z-scores object
coxdata <- data.frame(metadata, t(x))
# tidy column names
colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
'GGI', 'Grade', 'Node',
'Size', 'Time.RFS')
# prepare phenotypes
coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
2, test each gene independently via Cox regression
With the data prepared, we can now apply a Cox survival model independently for each gene (probe) in the dataset against RFS.
Here we will use RegParallel to fit the Cox model independently for each gene.
library(survival)
library(RegParallel)
res <- RegParallel(
data = coxdata,
formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
FUN = function(formula, data)
coxph(formula = formula,
data = data,
ties = 'breslow',
singular.ok = TRUE),
FUNtype = 'coxph',
variables = colnames(coxdata)[9:ncol(coxdata)],
blocksize = 2000,
cores = 2,
nestedParallel = FALSE,
conflevel = 95)
res <- res[!is.na(res$P),]
res
3, annotate top hits with biomaRt
Filter by Log Rank p<0.01
res <- res[order(res$LogRank, decreasing = FALSE),]
final <- subset(res, LogRank < 0.01)
probes <- gsub('^X', '', final$Variable)
library(biomaRt)
mart <- useMart('ENSEMBL_MART_ENSEMBL', host='useast.ensembl.org')
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(mart = mart,
attributes = c('affy_hg_u133a',
'ensembl_gene_id',
'gene_biotype',
'external_gene_name'),
filter = 'affy_hg_u133a',
values = probes,
uniqueRows = TRUE)
annotLookup
Two of the top hits include CXCL12 and MMP10. High expression of CXCL12 was associated with good progression free and overall survival in breast cancer in doi: 10.1016/j.cca.2018.05.041, whilst high expression of MMP10 was associated with poor prognosis in colon cancer in doi: 10.1186/s12885-016-2515-7.
4, encode statistically significant genes as Low
| Mid
| High
and plot survival curves
# extract RFS and probe data for downstream analysis
survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
'X203666_at', 'X205680_at')]
colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
'CXCL12', 'MMP10')
# set Z-scale cut-offs for high and low expression
highExpr <- 1.0
lowExpr <- -1.0
survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
ifelse(survplotdata$CXCL12 <= lowExpr, 'Low', 'Mid'))
survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
ifelse(survplotdata$MMP10 <= lowExpr, 'Low', 'Mid'))
# relevel the factors to have mid as the ref level
survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
levels = c('Mid', 'Low', 'High'))
survplotdata$MMP10 <- factor(survplotdata$MMP10,
levels = c('Mid', 'Low', 'High'))
library(survminer)
ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
data = survplotdata),
data = survplotdata,
risk.table = TRUE,
pval = TRUE,
break.time.by = 500,
ggtheme = theme_minimal(),
risk.table.y.text.col = TRUE,
risk.table.y.text = FALSE)
ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
data = survplotdata),
data = survplotdata,
risk.table = TRUE,
pval = TRUE,
break.time.by = 500,
ggtheme = theme_minimal(),
risk.table.y.text.col = TRUE,
risk.table.y.text = FALSE)
@Kevin Blighe:
Many thanks for your community contribution in Biostars, this thread is very informative and helpful to learn RNA-Seq analysis.
Kevin,
Unless there is a problem on my end, I think something may have gotten deprecated here. Running code as is only gives me mid and high curves for both genes.
Inspection of
shows that no samples meet the -1 zscore low expression cutoff (as far as I can see). I don't really have any questions about this. I just thought I would point it out just in case it is a repeatable error.
Hey, that is strange - thanks for the alert. I also just re-ran my own code and observe the same 'phenomenon'. Nothing surprises me anymore in bioinformatics, though. Bioinformatics is like the Wild Wild West.
Here are the new survival curves for this tutorial:
CXCL12
-------------------------
MMP10
Kevin,
I actually do have a quick question related to this now that I think about it (if you have time). I see you have your expression factor with three levels:
In theory this was supposed to produce three curves. Lets say I have a similar multi leveled expression factor that produces multiple curves and I want to do a test that makes a pairwise comparison of every single curve. ie low vs mid, mid vs high etc.
I should just be able to run this command at endpoint which as I understand gives a benjamini hochberg adjusted log-rank test p value for every possible comparison of the multiple curves.
Is this reasonable?
Thank you
Yes, well, in the example above (my example), we could have done it better by dividing the expression range into tertiles to ensure that there would be at least 1 sample per group. I just chose a hard cut-off of Z=1, though. The tutorial is just to foment ideas, though.
I am not familiar with
pairwise_survdiff()
but it looks like a useful function. Thanks for mentioning it here.Hi Kevin. Hope you good. I adapted your code to find the high, low and mid expressions of 14 genes. Can you please help me with a tutorial on how to conduct a pairwise survival plot possibly one that can pair say high level of TPL2 and VEGFA and low level of IGFBP3?
I already tried this but I didnt understand most of it
http://rstudio-pubs-static.s3.amazonaws.com/5896_8f0fed2ccbbd42489276e554a05af87e.html
Thanks.
I am unsure what you mean, but you can create a multivariable Cox model of the following form:
...or, just create a new variable that contains every possible combinatino of
high
|low
for these genes and then just use that in the Cox model.Hi Kevin, I will like to perform a multivariate analysis with my genes and I am thinking of using of high expression as z> 0 and low expression as z<= 0 in order to omit the mid expression bit. Does this look sound? I have been considering using the median as the cut off point as most studies have done but does that mean I have to find the median for all the genes to generate the survival curves?
And could you please help me with a tutorial on how to perform a box plot analysis with my data? I will like to use that to help me understand the expression profile of genes (i.e which ones are highly or low expressed among patients).
Thank you.
Hey again. >0 and <=0 is, essentially, a binary classification. It is not ideal but may have to be used for some genes with. I mean, a value of 0.25 is just 0.25 standard deviations above the mean value, which is not high. Aiming for something like >1.96 and < -1.96 would be better, as |Z|=1.06 is equivalent of p=0.05.
For box-and-whiskers plots, I am not sure... how about this? - A: Boxplot in ggplot2
Ok thanks. What about using the median as the cut-off point? This is because with the previous cut off points 1.0 and -1.0, most of the patients fell into the mid expression group which left very few patients with the high and low expression of genes?
I understand now. Thanks
Median can be used, too, and is better to use the median for non-parametric variables. Keep in mind that, sometimes, scaling (like I do in this tutorial) is not the best approach, and that, in place of this, maintaining the variables on their original scale is better.
Ok so I tried executing a code like this:
I realised that the curves generated were in line with what I was expecting ie high VEGFA corresponded with low survival and also it split my sample size into two for high risk and low risk. However, due to the answer given by Tom L. I found on the page below, I didnot go through with this.
determining cutoff for Kaplan Meier
Everybody has an opinion on everything. Edit: Tom's opening paragraph makes no sense to me, as, by splitting the gene expression by the median, it's in no way implying that "50% of patients will survive in your analysis". By splitting the gene expression by the median, we are just aiming to determine how higher or lower gene expression relates to survival / relapse.
I totally agree with you on the everyone has an opinion on everything part. My head has been splitting on all the differing views I get. Thanks by the way. Really appreciate it.
Dear Dr. Kevin,
As in the K-M plot clear, after running
ggsurvplot
we plot Kaplan Meyer which we can see a p-value on it. Here for "MMP10", the p-value equals 0.00047 in your example. I ran the same as your code for my target gene and also ran the Cox Proportional-Hazards Model for that.In my case, the p-value resulted from the Cox regression is 0.04 but the p-value resulted
ggsurvplot
for the K-M plot is about 0.1. based on Cox's p-value my study is significant but based on the K-M plot p-value isn't(greater than 0.05). is that results logically acceptable? If yes which p-value should be ignored and which one accepted?I appreciate it if you share your comment with me.
Please show the exact code that you have used in order to clearly show from where you are deriving your p-values.
For cox regression I use below code:
And by runnig that code I got below result:
As you see the P-Value(Pr(>|z|)) equal 0.0393. now in the following I performed K-M plot generating code:
So, in the following link the result of K-M plot is accecible. and you can see P-value in the plot equals 0.25:
https://www.dropbox.com/s/8rn89ithvqfyfqk/Rplot_K-M_MEturquoise_OS_981018.bmp?dl=0
I appreciate it if you share your comment with me
These are different functions, so, you should not expect that they return the same p-values. Take a look here:
Dear Dr. Blighe Thanks for your comment. but as I wrote in the last line of
summary(fit_SARC_turquoise)
result you can find Score (log rank) test in which the p-value equals 0.04 by 1 df. but this log rank p-value is different from p-value in K-M plot in this link: https://www.dropbox.com/s/8rn89ithvqfyfqk/Rplot_K-M_MEturquoise_OS_981018.bmp?dl=0I appreciate it if you share your comment with me
Is
survplotSARCturquoisedata
the exact same ascoxSARCdata
?No, because
coxSARCdata
has a few columns andsurvplotSARCturquoisedata
is a subset ofcoxSARCdata
. you mean for that reason they don't have similar P-value. Ok, Dear Dr. Blighe, how can I interpret this unsimilarity of 2 log-rank P-value resulted from the Cox regression and K-M plot? Can I insert P-value resulted from Cox regression in the K-M plot picture instead K-M plot P-value?Yes, you can add any p-value to the K-M plot - all that you need to do is:
However, you need to be sure that this is the correct thing to do.
Dear Dr. Blighe, I have 2 more questions:
1- I need to show K-M plots for 7 genes in one picture. How can I do it?
2- I need to resize of Font of labels(Survival probability, time,..) in the K-M plot.
I appreciate it if you guide me that how can I do them via my code.
Does that help?
So I tried to perfom this analysis with my data:
#loading data from GEO gset <- getGEO('GSE17536', GSEMatrix =TRUE, getGPL=FALSE) x<-exprs(gset[[1]])
But I keep getting
index1: 54001; index2: 54613 Error in { : task 1 failed - "No (non-missing) observations" after the RegParallel command. Can you tell me why please? P. S: the dataset recorded dfs_event as 'recurrence' and 'no recurrence' and Overall_event as 'death' and 'no death'
Hey, I think that it means that you have a variable that has no values, i.e., a variable that has only NA or infinite values, Have you screened your input data to ensure that all variables are complete?
Hi I realised that whenever I executed the commands:
the values for these columns would all change to NA. I did this a number of times and got the same result. Please do you know why this keeps happening?
Please ignore the comma at the end of the code.
That's a change introduced in R 4.0.0. I will have to modify the tutorial code.
You should try:
Hey I tried that as well after seeing on a platform like this but I got the same response. Tried again this morning and got the same NA problem.
I also restarted R and re-executed the codes but I keep getting the same response.
Hello again. What is the output of:
I see, but this is not an issue with my tutorial. You need to properly encode your DFS variables. Here is the pData for your dataset:
Hello Kevin. Sorry am quite new to R. Please what do you mean when by properly encoding my DFS variables. I also tried to execute the code above and I got this instead:
I see.. trying to adapt this tutorial to your own data will prove difficult for people who are new to R.I recommend that you first go through the entire tutorial as I have presented (above) - in this way, you will be better equipped to later adapt the code to your own data.
Hi Kevin, I read the as.numeric(as.character(x)) converts my data from factor to character and then to numeric. So I tried this code:
hoping that the data will be converted from character to factor to numeric.
And I got this response: dfs_event
where 1: NA, 2: no recurrence, 3: recurrence. Am wondering if this will this affect my COX analysis?
Hello again. The Cox regression function that is used in this tutorial requires data to be:
As per my own code (above):
You will have to encode your variable as
0
and1
.Take a look at the
sub()
andgsub()
functionsHi. So this is what I eventually and it seemed to work:
Sure, but, where you use
as.numeric(as.factor())
together in this way, you need to be careful about how it converts the factors into numbers - the behaviour may not always be what you expectHi Kevin. This may seem odd but I will like to know how R interprets:
and
This is because when I used the second to plot a that had a p value of 0.0024 making the relation significant (which was expected) but the first plot gave a p value of 0.32. I got the first code from a friend who was helping me out.
I would indeed expect different p-values here because the parameters that are passed to
Surv()
are interpreted differently based on how many are passed. Take a look at?Surv
, or here: https://www.rdocumentation.org/packages/survival/versions/3.2-3/topics/SurvHi Kevin,
Thank for the wonderful tutorial and your major contribution to Biostar. I am wondering whether you could explain what you meant by two of the top hits for CXCL12 and MMP10. Are these selected based p-value?
I saw many other genes had the very smaller p-value compared with CXCCL12 and MMP10? I am just a little confused myself and hope you could show why you selected only CXCL12 and MMP10. Hope you can help. Thank in advance.
Kind Regards,
synat
Hello Kevin Blighe I would like to get your help in this
One typo was found: discard <- apply(metadata, 1, function(x) anyis.na(x))) should be discard <- apply(metadata, 1, function(x) anyis.na(x)))
Thank you for noticing that. My raw code was actually correct - the error (the lack of an extra parenthesis,
(
), was introduced in the visual representation of my code by the Biostars rendering system. I have added a space, and it now looks fine.Gud one Kevin. I have a question. On what basis Z-scale cutoff 1.0 is selected?
BTW In this tutorial [http://r-addict.com/2016/11/21/Optimal-Cutpoint-maxstat.html] they have used
maxstat
(Maximally selected rank statistics) for the cutpoint to classify samples into high and low.That looks like a good tutorial (through the link that you posted).
The selection of absolute Z=1 was just chosen as a very relaxed threshold for highly / lowly expressed. It is difficult to know where the exact cut-offs should be, and of course biology does not intuitively work on cut-off points. The tutorial above is for fomenting new ideas for survival analysis.
Hi Kevin, do you think this method will work in this case as well. I was worried that it might not work since the gene expression levels have been standardized.
Dear Dr. Blighe
Thank you for this tutorial. I have a question about using
Scale()
for transforming expression data to Z scores. basically, why do we need transforming to z scores while our original data(downloaded from GEO) is normal?I appreciate if you share your comment with me.
Mohammad
Hello Mohammad. The conversion to Z-scores provides for an easier interpretation on the expression range for each gene. For example, on the Z-scale, we know that +3 equates to 3 standard deviations above the mean expression value in the dataset.
In a normal distribution that is not transformed to Z-scale, a value of 10, 20, 30, et cetera may mean nothing in the context of the expression range.
The relationship between a normal distribution and the Z-scale is emphasised in this beautiful figure:
[source: https://www.mathsisfun.com/data/standard-normal-distribution.html]
Dear Dr. Blighe
Thanks for your answer. I have another questions about your SA tutorial due to using RNA-seq expression data:
1-Generally, the measure of expression in RNA-seq is count and different from measure of expression in Microarray Technology. So, for using RNA-seq, Should I modify your survival analysis code? special in Standardization step?
2- honestly, I cant understand
'~ [*]'
informula = 'Surv(Time.RFS, Distant.RFS) ~ [*]'
3- phenotype of my data set has fours fields: 'OS status','OS days','RFS status','RFS days'. So, based on
RegParallel()
, can I compute 'res' using my phenotype fields? if yes, how can I use these fields inRegParallel()
?I deeply appreciate if you share your comment with me.
Mohammad
You should aim to transform your normalised RNA-seq counts via the variance-stabilised or regularised log transformation (if using DESeq2), or produce log CPM counts (if using EdgeR). Then, you can generally use
glm()
, as I use above. If you are aiming to use the normalised, un-transformed counts, then you could use the negative binomial regression viaglm.nb()
- this may be too advanced, though. Check the manual (via?RegParallel
) and vignette for RegParallel.Remember that, in RNA-seq, the general process goes:
This is annotation specific to my package, RegParallel. Each gene will replace the
[*]
symbol as the package tests each gene in an independent model. Again, please read the manual and vignette.Then you are likely aiming to do a survival analysis. In that case, you can use
coxph()
. Your commands would be:...or:
Note, you will likely have to change the value to
variables
. Variables is a vector of gene names that you want to test. It can be any number. If using RegParallel, the idea is that you have hundreds or thousands or millions of genes to test."normalised counts (statistical analyses performed on these) -->" i have this doubt found that you mentioned about it , are you saying about this function "counts(dds, normalized=TRUE)" whose value can be used for any non parametric statistical lest?
As of now i used mostly rlog and vst value for clustering and pca etc . Not sure if i can use any test on the rlog or vst value , as im not sure if its mathematically correct thing to do with the rlog or vst value or not...
If you can clarify it would be really helpful
No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:
*there are variations to this
One can, of course, produce normalised, transformed counts, and perform their own analyses on these.
"No, it is just in the DESeq2 protocol (and EdgeR). The statistical comparisons are conducted on the normalised, un-transformed counts, which follow a negative binomial distribution. DESeq2 derives p-values, generally, as follows:
fit negative binomial regression model independently for each gene's normalised counts extract p-value from the model coefficient via the Wald test applied to the model" yes this part im clear as i read the same in the paper
"of course, produce normalised, transformed counts, and perform their own analyses on these." yes for this one as i get certain genes and i want to make comparison between biological sample .So if i do that comparison running some non parametric test then its not a problem , I guess
I see. Which test are you doing?
I tried wilcoxon rank test as of now...
Seems okay to me. Ask 10 people and you'll get 10 different answers, though. Each answer is based on the respective experience of the individual.
one doubt is clear now
Dear Dr. Blighe
I use TPM(Transaction per million) method for normalizing my RNA-Seq data set. base on your perfect tutorial I ran RegParallel() for getting survival analysis. I need your comment for 2 below questions:
1- I use 'coxph' as FUNtype for the regression model. is it a suitable function for my problem. if no, which function is your suggestion?
2- As you know in literature, we have multivariate Cox regression and univariate Cox regression. based on RegParallel(), our Survival Analysis is multivariate or univariate?
It would be really helpful If you can clarify me.
Yes, coxph is the correct function. TPM is not too bad if you are testing each gene independently, i.e., univariable (in my tutorial, above, each gene is tested independently as part of a univariable Cox model);
Ok, Thanks for your comment. logically, doing multivariate Cox Regression for lots of genes(more than 150 genes) is true? if you agree, how can I run it? can you guide me by tutorial such as the above tutorial?
If you want to run a multivariable Cox model with that many variables, then consider using the Lasso Cox model: https://cran.r-project.org/web/packages/glmnet/vignettes/Coxnet.pdf
Dear Dr. Blighe
Really Thanks for your answer. But about my first question, I would like to explain more about my data set. It belongs to TCGA and I downloaded as UQ-FPKM. In RNA-seq analysis, this type of data set is normal. So, for using that I transformed it to Log2 space.
1- now, for using this data should I
scale()
for transformation to z-score?2- based on my explanationabout TCGA data, which functions are better:
glm()
orglm.nb()
?3- why you didn't use
coxph()
for RNA-seq expression data set in RegParallel vignett? can I use this function for my data set?I deeply appreciate if you share your comment with me.
Mohammad
Sorry, this is not how Biostars functions. I expect you to read my comments and to then spend some time researching the answers to any further questions that you have. That is the best form of learning.
Dear Dr.Blighe
Here you design Survival plot for 2 genes: 'MMP10' and 'CXCL12'. Suppose that we have a bunch of gene and after clustering we have n cluster. how can we design Surv plot for each cluster separately?
I appreciate if you share your solution with me
What do you mean by 'n cluster'?
n is number of cluster. For example 3 cluster(n=3).
If you encode the gene's expression as a factor / categorical variable, then the survival function will plot a curve for each level
Hi Kevin, I am curious to ask can we use Beta values for methylation from each probe instead of the read-count from gene expression.
If yes, these values are continuous and range from 0 to 1, would it be recommended to convert these also to Z score.
Hey, yes, you could use the Beta values from methylation for the purposes of survival analysis. I think that it is okay to leave the values as 0 to 1.
Thanks Kevin, I tried your suggestion and was able to identify prognostic CpG sites. I did the same using gene expression data and interestingly found some overlapping genes. But I am not very sure how to integrate these two results as methylation can regulate the expression of genes that are in trans. In order to address that, checking just the overlap would not work. I do not know how should I proceed. I will really appreciate if u can share your thoughts about it.
Thanks B
Hi Dr. Blighe, My survplotdata is as below:
I used 0 as cut-offs for high and low expression
When I run this code:
I get this Error:
I appreciate if you guide me and share your comment for solving that Error with me.
Take a look at
?ifelse
Thanks, Dr. Blighe. I solved my problem but in the below code:
I get below Error:
I appreciate if you guide me and share your comment for solving that Error with me.
Okay, please spend some more time to debug the error on your own. Check the encoding of your variables, and check what
survfit()
andggsurvplot()
expect.Hi Kevin, thanks for creating this package. I'm learning survival analysis, and am finding your tutorial is very helpful.
I was wondering regarding your suggestion to arrange the tests by log rank p value. From my understanding, the log rank test is computed comparing survival time between groups. So in the RegParallel function, is gene expression being dichotomized? If so, how exactly---is it using Z-score +/- 1? Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?
Thanks, Victor
No, the package just accepts whatever data that you use. It can be continuous or categorical. It is just in this tutorial that I dichotomise the gene expression values before using the RegfParallel package.
Moreover, because gene expression is continuous, would it not make sense to select 'statistically significant' genes based on p value (and adjust those instead of the log rank p value)?
You can do whatever approach seems valid to you. There is no correct or incorrect approach.
without clinical information this is not possible to do so isn;t it? If i look at the microarray data of liquid tumor they dont give information as such as you have used here. Is there still a way to run survival analysis ?
Hey, what information do you have, exactly?
so far the microarray data for AML have checked are mostly array expression, they dont give the clinical information of the patients which in this case you have for the breast cancer data set.
Hey kelvin, this is a great tutorial. This is my first time for this kinda analysis, can you please tell how to use data obtained from TCGA both count and clinical data for this analysis.
Thanks
For quick and easy analysis, you can simply use a website like cBioPortal or oncolnc.org
If you want to do it yourself, here's a good tutorial: Survival analysis of TCGA patients integrating gene expression (RNASeq) data
Hi Kevin,
Thank you very much for this helpful tutorial. I spent some time to figure out how to do this analysis before coming across your post. I'd appreciate if you can comment on my approach and please let me know if you find it inaccurate.
I downloaded TCGA RNAseq and miRNAseq data and used
voom
transformation as follows:Then I combined these normalized data with clinical parameters such as
vital_status
anddays_to_death
to perform survival analysis. I'm recycling this code for 30 separate tumors as a general approach, thus I don't have a predetermined design. I am also trying to calculate correlations between protein-coding-gene vs miRNA pairs to find associations. For my purposes do you thinkvoom
normalization is appropriate?Best, Atakan
Hi Atakan, yes, if I was using data deriving from EdgeR, then I would use the 'voom' expression levels. That is, the voom levels would represent the 'coxdata' object in my tutorial.
Dear Kevin, excellent and comprehensive tutorial as always !! I have three quick questions regarding the implementation of your tutorial: briefly, based on the TCGA-GDC RNA-Seq dataset of breast cancer, i have identified a very small number of genes (~5) with significant differences in overall survival, based on the stratification of cancer samples as high vs low. My next goal is to search additional datasets-even microarrays-to test the same hypothesis, as also if subtype available to correlate it also with survival. Thus, my quick questions are the following:
1) Regarding the pre-processing of microarray data-you scaled only the data, as you have downloaded an already normalized gene expression matrix correct ?
2) I saw you have performed cox regression on relapse-free survival- checked also from the supplementary material, that some of the patients have not received any type of therapy-thus, from my goal and perspective, I can still perform survival using RFS, even to test if these genes exhibit a correlation with survival associated with therapy, even if it is not overall survival ?
3) Even if i have specific gene targets, I can still perform cox regression to investigate if these genes illustrate a significant outcome associated with survival ?
Best,
Efstathios
Yes, that is correct, i.e., the data is already normalised (and log [base 2] transformed).
Yes, you can perform survival analysis using any metric. It can be 'days to relapse', 'days to death', 'days to first disease occurrence', etc. The term 'survival' was always somewhat misleading.
Yes, and you can include all genes in the same model, or test each gene independently, i.e., in separate models
Dear Kevin,
thank you very much for your answer !! do you think that based on the experimental design of this dataset-that is the majority of the patients have undergone initial therapy-RFS would be a more "robust" estimate of survival,as essentially if measuring overall survival, is more related to patients without any therapy ? and then I can assume if a statistically significant RFS survival appears, that any gene related is implicated in survival mechanisms related to therapy ? as a measure of resistance ?
I cannot confidently answer these follow up questions
Hi Kevin,
Great tutorial, thanks so much for taking the time to write and share it. I would like to ask a question just to clarify my understanding. Apologies if this is very simple/obvious, I am coming from a pure biology background with not much statistical training.
Am I correct in thinking your code is performing a univariate analysis on each gene? I can see the model is looping to test each variable separately, and that the variables are defined as each gene in the below line:
However I am struggling the understand, whether/where the phenotype data (age, ER status, grade etc) is being used by the model. Is it referenced by assigning the data as the full 'coxdata' dataframe, as below?
If so, is this different from passing the phenotype data as an explicit variable(s) and performing a multivariate analysis on each gene in conjunction with the phenotype data?
I appreciate any advice or direction to further reading to improve my understanding!
Thanks again,
Sian
Hey Sian, yes, it performs a univariate test on each gene / variable that is passed to the
variables
parameter. This is the same as any standard differential expression program.If you want to adjust for a covariate, say, ER-status, then you would do something like:
I'm aware that the syntax of this package's commands is not too easy to interpret but, in certain respects, I wanted it to be that way in order to avoid any mis-use. To use it, one has to have a general understanding of regression modeling, i suppose.
PS - that will output a line for ERstatus for each gene, so, you may want to automatically exclude those model terms via the
excludeTerms
parameter.Hi Kevin. Am back again lol. Is it possible to test the high and low expression of the genes with each of the phenotype data?
Hello again, trust that you are well. I am not sure what you mean, but it sounds like you want to stratify your cohort into high and low, and then re-run it separately?
No please. I want to perform an ANOVA test (I think) to show the relation between the high and low expression of my genes (18 in all) and the phenotype data separately, that is age, gender, UICC and grading (2 or 3).
Then we are talking about a binary logistic regression model:
, where gene is encoded as
high
|low
.Or, you could possible have:
Yes please. Could you help me with a tutorial on how to do this please? I used the code
But I realised it only shows the relation between the genes as a whole (but not dichotomized into high and low expression) and each of the phenotype data.
I see. So, you need to perform the dichotomisation prior to running RegParallel. This is covered in Part 4 (above), but you will have to find a way to loop over all genes in your input data
Ok. I will try a create a new data frame with the dichotomized genes and the phenotype data. Hope it works out.
Fingers crossed
So I tried doing this,
Where the various gene names represent the respective gene columns with the expression values replaced with 'high' and 'low'. But I got this response instead:
Are there only 9 genes in your dataset? In that case, I would literally just write out the models individually. RegParallel was really designed for datasets containing 1000s of variables and/or where 1000s or millions of different tests needed to be performed.
It should work based on how you have set it up, though. I wonder could you try to install the current development version and retry the same code:
After multiple tries, I keep getting this:
Oh and you were right about testing the genes individually because of the new data frame. It worked when I tried. Thanks a lot AGAIN.
Good that you got it working. The Rcpp issue may relate to a rights issue, as Rcpp requires installation of system files. No big issue though
Hello Dr. Kevin. Thank you very much for these tutorials. You helping thousands of students from all over the world (Here one from Spain).
I've adapted your code to my HTA 2.0 microarray studio. And I've gone from having 350 candidate genes to 35 genes that influence patient survival. My boss told me I might be able to reduce the number of genes using a multivariable model.
My question is whether your code can be used with a penalized COX multivariable model. The study I am doing is with prostate cancer, and I have many clinical factors that may be helpful (PSA, alkaline phosphatase etc.)
Thank you again
Hola, ¿cómo estás?
The idea of this tutorial is to perform Cox PH independently for each gene, i.e., it is univariate, and this can help to reduce a large number of variables, in your case, 350 to 35.
A penalised Cox regression would take all 350 genes concurrently. The 'final' list of genes would be those whose coefficients are not shrunk (reduced) to 0. You would do this via the glmnet package.
I think that both methods are compatible with each other. After you do the penalised Cox regression, you can still plot the survival curves for some of the genes that make it to the final list.
Muy bien gracias! :P Thank you for your reply. Yes, I will do that. Do you know of any tutorials for doing the penalized Cox regression? I haven't found anything on the Internet applied to genes and clinical data.
Yep / Sí, you could try this: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#cox
Thank you for you reply. I got it! But now, one more question. Now that I have the genes identified, I want to validate them with a validation set samples. What method would you use? I have taken my genes that affect patient survival and used them using the clinical data from the validation set patients, and nd I get a 0.9 AUC in ROC. But I think this method is not optimal, right?
Not optimal in which way? You should derive the confidence intervals around the AUC, too.
At first, I used that model with validation patient set to see if the ROC was still high. However, I read that this is not correct, as I am redoing the coefficients, not validating them.
To do a validation, I found this package that allows you to do internal and external validation. https://cran.r-project.org/web/packages/hdnom/vignettes/hdnom.html#2_build_survival_models
For that part, which is somewhat outside of my knowledge area, you may want to ask a question on a stats forum, like CrossValidated. I am actually only relatively recently working in internal and external calibration, so, I do not feel it is my place to provide advice right now.
Hello agan @kevin. Finally I could validate my gene model in the external validation dataset. My question now is: Is there a parsimonious method to reduce the number of genes without having an effect on the final ROC? For example, my gene model has 34 candidates. I would like to know if all 34 are essential or if I can reduce that number without affecting the AUC.
Thank you in advance
Hi Kevin How calculate FDA in COX-PH regression!!!?
What is FDA?
Dear Kevin Blighe , after running RegParallel i got this warning:
and then i got 6000 genes with LogRank < 0.05, which is really strange (GSE had a total of 54000 genes) , is there something wrong?
however, i got the following curve:
and, may you help with another question:
i have used median cutoff for RNA seq data via the code below, but for quantile how can i change the code?
it did work with
quantile(clin_df$gene_value)
, but with warnings and i don't think this is the right way,Thank you for your kind responses.
Kevin Blighe Extremely thankful to you. Amazing post; really helped me understand the basic of reading K-M plot w.r.t gene expression. Thank you Sir. :-}
Thank you very much for the comments, kountaydwivedi
Hi Kevin, Kevin Blighe
I noticed that what you used for the calculation of the z-score: x <- t(scale(t(x))), produces different result from the scale() function and different result if someone uses the definition of z-score; for example let's say our feature is: "feature" and our data is "data", then the z-score of the "feature" can be calculated by: data$feature_scaled = (data$feature- mean(data$feature))/sd(data$feature). The definition and the scale() produce exactly the same result as it was expected.
Do you have any input on that and why you used the x <- t(scale(t(x))) instead of using the: x <- scale(x) ?
Thanks in advance for your time.
Bests, Thanos
Hi Thanos, The definition of the
scale()
function is:So, it only by default scales the columns, on a per column basis. The alternative would be global scaling based on the mean and sdev of the entire matrix.
So, if we want to scale a matrix by row, we must use:
Hi Kevin, Kevin Blighe
Thanks for the clarification. If I understood well, your approach is that given a dataset with features instead of scaling each feature using information from only that feature (using scale() function to calculate z-score), you used information from all the columns to calculate the z-score, since you calculated the mean and the std for the whole matrix aka all the features. If that's the case, I should mention that this approach introduces bias on each datapoint. It correlates each value of a feature with any value of any other feature. In general, it is something that we never do at least in machine learning/deep learning. But maybe I got something wrong here.
In any case, thank you for the clarification.
Bests, Thanos