Tutorial:Survival analysis of TCGA patients integrating gene expression (RNASeq) data
1
186
Entering edit mode
9.3 years ago
TriS ★ 4.7k

I found myself being often confused about how to do this and by various posts and tutorials online I realized that, if I could put together a step-by-step tutorial, that might be (hopefully) useful...so there we go.

First question: Why would you want to do survival analysis based on gene expression data? Well, let's say you have a number of genes that you are interested in and they are differentially expressed between tumor and normal samples, it would be very powerful to show that alteration in gene expression correlates with worse survival or earlier tumor recurrence.

Here's a brief overview of what's in here:

  1. obtain the data (RNASeq and clinical data)
  2. prepare the data for the analysis
  3. run the analysis and interpret the results

In this tutorial I will be using data from kidney cancer (clear cell renal cell carcinoma - KIRC) that you can get as follows:

  1. ​go to FireBrowse (http://gdac.broadinstitute.org/), select "Kidney renal clear cell carcinoma" -> "Browse"
  2. it should open a page with various data sets for KIRC
  3. from "mRNASeq" select "illuminahiseq_rnaseqv2-RSEM_genes_normalized" and save it on your computer
  4. from "Clinical" select "Merge_Clinical" and download it
  5. unzip the files by double clicking on them
  6. rename the folders as "RNA" and "Clinical"

Now you can switch to R, you will need the package "survival" (https://cran.r-project.org/web/packages/survival/index.html)

# set directory to where the RNA and Clinical folders are
setwd("/User/myUser/tutorial")
library(survival)

# read RNA file 
rna <- read.table('RNA/KIRC.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt',nrows=20533, header=T,row.names=1,sep='\t')
# and take off first row cause we don't need it
rna <- rna[-1,]

# and read the Clinical file, in this case I transposed it to keep the clinical feature title as column name
clinical <- t(read.table('Clinical/KIRC.merged_only_clinical_clin_format.txt',header=T, row.names=1, sep='\t'))

# first I remove genes whose expression is == 0 in more than 50% of the samples:
rem <- function(x){
  x <- as.matrix(x)
  x <- t(apply(x,1,as.numeric))
  r <- as.numeric(apply(x,1,function(i) sum(i == 0)))
  remove <- which(r > dim(x)[2]*0.5)
  return(remove)
}
remove <- rem(rna)
rna <- rna[-remove,]

Now I need to identify normal and tumor samples. this is done using the TCGA barcode (https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode). The two digits at position 14-15 of the barcode will indicate teh sample type, from the link:

"Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29."

# see the values
table(substr(colnames(rna),14,14))
# 0      1 
# 534   72

So we have 534 tumor and 72 normal

# get the index of the normal/control samples
n_index <- which(substr(colnames(rna),14,14) == '1')
t_index <- which(substr(colnames(rna),14,14) == '0')

# apply voom function from limma package to normalize the data
vm <- function(x){
  cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1,  0))
  d <- model.matrix(~1+cond)
  x <- t(apply(x,1,as.numeric))
  ex <- voom(x,d,plot=F)
  return(ex$E)
}

rna_vm  <- vm(rna)
colnames(rna_vm) <- gsub('\\.','-',substr(colnames(rna),1,12))

# and check how data look, they should look normally-ish distributed
hist(rna_vm)
# we can remove the old "rna" cause we don't need it anymor
rm(rna)

Now we can finally scale the data. the reason to do so is because we don't want to use ONLY the fold changes. if we use FC then we average the expression values across all samples, losing the heterogeity that is characteristic of those data. we therefore transform data to z-score so that per each patient for each gene we will have a measure of how many SD away from the mean that is and we will consider those with Z > +/- 1.96 (roughly p=0.05 or 2 SD away) to be differentially expressed

To obtain z-scores for the RNASeq data, we use following formula:

z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal)
# calculate z-scores
scal <- function(x,y){
  mean_n <- rowMeans(y)  # mean of normal
  sd_n <- apply(y,1,sd)  # SD of normal
  # z score as (value - mean normal)/SD normal
  res <- matrix(nrow=nrow(x), ncol=ncol(x))
  colnames(res) <- colnames(x)
  rownames(res) <- rownames(x)
  for(i in 1:dim(x)[1]){
    for(j in 1:dim(x)[2]){
      res[i,j] <- (x[i,j]-mean_n[i])/sd_n[i]
    }
  }
  return(res)
}
z_rna <- scal(rna_vm[,t_index],rna_vm[,n_index])
# set the rownames keeping only gene name
rownames(z_rna) <- sapply(rownames(z_rna), function(x) unlist(strsplit(x,'\\|'))[[1]])

rm(rna_vm) #we don't need it anymore

It might take some time, depending from your computer, be patient

Now we can start using the clinical data:

# match the patient ID in clinical data with the colnames of z_rna
clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
sum(clinical$IDs %in% colnames(z_rna)) # we have 529 patients that we could use

# get the columns that contain data we can use: days to death, new tumor event, last day contact to....
ind_keep <- grep('days_to_new_tumor_event_after_initial_treatment',colnames(clinical))

# this is a bit tedious, since there are numerous follow ups, let's collapse them together and keep the first value (the higher one) if more than one is available
new_tum <- as.matrix(clinical[,ind_keep])
new_tum_collapsed <- c()
for (i in 1:dim(new_tum)[1]){
  if ( sum ( is.na(new_tum[i,])) < dim(new_tum)[2]){
    m <- min(new_tum[i,],na.rm=T)
    new_tum_collapsed <- c(new_tum_collapsed,m)
  } else {
    new_tum_collapsed <- c(new_tum_collapsed,'NA')
  }
}

# do the same to death
ind_keep <- grep('days_to_death',colnames(clinical))
death <- as.matrix(clinical[,ind_keep])
death_collapsed <- c()
for (i in 1:dim(death)[1]){
  if ( sum ( is.na(death[i,])) < dim(death)[2]){
    m <- max(death[i,],na.rm=T)
    death_collapsed <- c(death_collapsed,m)
  } else {
    death_collapsed <- c(death_collapsed,'NA')
  }
}

# and days last follow up here we take the most recent which is the max number
ind_keep <- grep('days_to_last_followup',colnames(clinical))
fl <- as.matrix(clinical[,ind_keep])
fl_collapsed <- c()
for (i in 1:dim(fl)[1]){
  if ( sum (is.na(fl[i,])) < dim(fl)[2]){
    m <- max(fl[i,],na.rm=T)
    fl_collapsed <- c(fl_collapsed,m)
  } else {
    fl_collapsed <- c(fl_collapsed,'NA')
  }
}

# and put everything together
all_clin <- data.frame(new_tum_collapsed,death_collapsed,fl_collapsed)
colnames(all_clin) <- c('new_tumor_days', 'death_days', 'followUp_days')

Now, to do survival analysis we need three main things:

  1. time: this is the time till an event happens
  2. status: this indicates which patients have to be kept for the analysis
  3. event: this tells i.e. which patients have the gene up- or down-regulated or have no changes in expression

Since we want to do censored analysis, we need to have something to censor the data with. For example, if a patient has no death data BUT there is a date to last followup it means that after that day we know nothing about the patient, therefore after that day it cannot be used for calculations/Kaplan Meier plot anymore, therefore we censor it. so now we need to create vectors for both 'time to new tumor' and 'time to death' that contain also the data from censored individuals.

# create vector with time to new tumor containing data to censor for new_tumor
all_clin$new_time <- c()
for (i in 1:length(as.numeric(as.character(all_clin$new_tumor_days)))){
  all_clin$new_time[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$new_tumor_days))[i]),
                    as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$new_tumor_days))[i])
}

# create vector time to death containing values to censor for death
all_clin$new_death <- c()
for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){
  all_clin$new_death[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$death_days))[i]),
                                 as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i])
}

Now we need to create the vector for censoring the data which means telling R which patients are dead or have new tumor. in this case if a patient has a "days_to_death" it will be assigned 1, and used in the corresponding analysis. the reason why we censor with death events even for recurrence is pretty important. a colleague made me notice that this is a competitive risk problem, where, although a patient can recur and then die, if a patient is dead, it will not recur, therefore is more accurate to censor for death events.

# create vector for death censoring
table(clinical$patient.vital_status)
# alive dead
# 372   161

all_clin$death_event <- ifelse(clinical$patient.vital_status == 'alive', 0,1)

#finally add row.names to clinical
rownames(all_clin) <- clinical$IDs

Now, one more thing to do is to use the z-score values we obtained from the RNASeq data and define which samples are altered or do not change in this case I just look at mRNA deregulation, you can divide the data into up- and down- regulated too if you wish

# create event vector for RNASeq data
event_rna <- t(apply(z_rna, 1, function(x) ifelse(abs(x) > 1.96,1,0)))

# since we need the same number of patients in both clinical and RNASeq data take the indices for the matching samples
ind_tum <- which(unique(colnames(z_rna)) %in% rownames(all_clin))
ind_clin <- which(rownames(all_clin) %in% colnames(z_rna))

# pick your gene of interest
ind_gene <- which(rownames(z_rna) == 'TP53')

# check how many altered samples we have
table(event_rna[ind_gene,])

# run survival analysis
s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])
s1 <- tryCatch(survdiff(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum]), error = function(e) return(NA))

# extraect the p.value
pv <- ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]

# plot the data
plot(survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum]),
     col=c(1:3), frame=F, lwd=2,main=paste('KIRK',rownames(z_rna)[ind_gene],sep='\n'))

# add lines for the median survival
x1 <- ifelse ( is.na(as.numeric(summary(s)$table[,'median'][1])),'NA',as.numeric(summary(s)$table[,'median'][1]))
x2 <- as.numeric(summary(s)$table[,'median'][2])
if(x1 != 'NA' & x2 != 'NA'){
  lines(c(0,x1),c(0.5,0.5),col='blue')
  lines(c(x1,x1),c(0,0.5),col='black')
  lines(c(x2,x2),c(0,0.5),col='red')
}

# add legend
legend(1800,0.995,legend=paste('p.value = ',pv[[1]],sep=''),bty='n',cex=1.4)
legend(max(as.numeric(as.character(all_clin$death_days)[ind_clin]),na.rm = T)*0.7,0.94,
       legend=c(paste('NotAltered=',x1),paste('Altered=',x2)),bty='n',cex=1.3,lwd=3,col=c('black','red'))

image: plot

Now, the p.value in this case tells you that patients with altered TP53 have a worse survival…sad but true.

The same concept can be applied to time to new tumor (new_time) (disease free survival, or DFS) and the resulting graph is the following:

image: plot

This approach is very flexible, I leave to the reader the satisfaction to subset the events in up/downregulation and/or add covariates.

I know there are a number of different ways of doing this, hopefully this will be clear enough that the reader can replicate this.

Lastly, thanks for getting till down here, comments, critiques, corrections etc are always welcome :)

UPDATES

  • used voom transformation for RNASeq data

  • created the death event vector based on the vital status from the clinical data

  • updated code and images based on the results

  • corrected the last rna_log variable (leftover from previous versions) to rna_vm

  • updated "min" to "max" for the "days_to_last_followup" based on cying comment

  • updated "min" to "max" for the new_tumor and death
  • changed the ifsumis issue with brackets
RNA-Seq Survival TCGA • 101k views
ADD COMMENT
4
Entering edit mode

I think this is a great tutorial, thanks for posting!

One thing I would add is to be careful of how you define a recurrence for the DFS data. While overall survival is rather straightforward, DFS can get complicated and is tumor dependent. For example, in head and neck cancer a patient is clinically classified as "WITH TUMOR" until the tests can be performed to verify the tumor is gone (which can be months from the original appointment). This can result in a "WITH TUMOR" classification in TCGA data for the first followup when really the patient was tumor free, they just haven't had the test done yet. Also, in HNSC tumors patients re-diagnosed with a tumor within 6 months of treatment completion it is considered part of the original tumor, not a new tumor, so it is residual tumor and not a recurrence. For HNSC at least (according to my expert collaborators) a re-diagnosed patient is considered to have a recurrence of the original tumor if it's between 6 months and 5 years, and 5 years post-treatment it is considered a new primary tumor. Of course these may be different for each tumor type. So, if one is looking to publish survival plot for DFS I would urge them to talk to a specialist in the cancer type to get specific guidelines on what disease-free survival means, and then manually clean up the DFS time points and events before plotting using the treatment data as a guide.

ADD REPLY
1
Entering edit mode

@alolex - thanks for the insightful feedback. you definitely make a valid point and I agree with you, the concept of DFS can be molded depending upon annotation accuracy/frequency and tumor type. and also, yes, before publishing any study on clinical data I would warmly suggest to talk to a clinician/statistician to make sure that the results make sense.

As for the KIRC data go, there is a person_neoplasm_cancer_status field that indicates whether the patient is "with tumor" or "tumor free". for the purpose of the tutorial I censored based on the presence of a days_to_new_tumor_event value but I agree it is more accurate to use the "with tumor" or "tumor free" indicators. I will update the code asap. Thanks

ADD REPLY
4
Entering edit mode

Hi!

Thank you for this tutorial!! I trid the KIRK dataset and overall, it worked great!

I had to change the following line because of an error:

clinical <- t(read.table('Clinical/KIRC.merged_only_clinical_clin_format.txt',header=T, row.names=1, sep='\t'))

changed to:

clinical <- read.table('Clinical/KIRC.merged_only_clinical_clin_format.txt',header=T, row.names=1, sep='\t')
clinical <- as.data.frame(t(clinical))

Reason:

clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
Error in toupper(clinical$patient.bcr_patient_barcode) : 
  error in evaluating the argument 'x' in selecting a method for function 'toupper': Error in clinical$patient.bcr_patient_barcode : 
  $ operator is invalid for atomic vectors

In addition, I get an error when I want to calculate p-values, however, I cant solve the problem:

pv <- ifelseis.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
Error in ifelseis.na(s1), next, (round(1 - pchisq(s1$chisq, length(s1$n) -  : 
  error in evaluating the argument 'yes' in selecting a method for function 'ifelse': Error: no loop for break/next, jumping to top level

Can someone help me to fix this?

Regards, G

ADD REPLY
0
Entering edit mode

Ok, got it, this one works:

> pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

Regards, G

ADD REPLY
0
Entering edit mode

gitarrestunden - thanks for the feedback. I see a few issues in the code that you wrote, and that includes parenthesis and spaces between words. I'm not sure how the following line works since the parenthesis are off :

pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

and it should be more like :

pv <- ifelseis.na(s1), next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]
ADD REPLY
0
Entering edit mode

Yes, I know what you mean, but the ifelse statement does not work in my system. I tried with correct parentheses and with or without commas/spaces...

> pv <- ifelseis.na(s1)),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]
Error: unexpected ',' in 'pv <- ifelseis.na(s1)),'

This is the only working version on my system:

> pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]

Strange..

ADD REPLY
0
Entering edit mode

And it is also strange that the first parenthesis which should come directly after the "if" always disappears when I click the add comment button!!

ADD REPLY
0
Entering edit mode

hmm...I wonder if there is a "lost" parenthesis somewhere else in your code

the ifelse function in R requires the commas and the reason why it tells you that there is an unexpected "," is because there is a parenthesis missing. the code you wrote is a one liner that sums up the formal if/else statements

for the parenthesis not showing up in the forum, I really have no idea :)

ADD REPLY
0
Entering edit mode

Hi Tris thank you for the tutorial, it is very useful. I get a similar error in generating p- value. Also some of the corrections made in the command line still shows an error. Can someone help fix this. Thank you in advance.

> pv <- ifelseis.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
**Error** in ifelseis.na(s1), next, (round(1 - pchisq(s1$chisq, length(s1$n) -  :  
    no loop for break/next, jumping to top level

> pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]
**Error**: unexpected symbol in " pv <- if is.na"

> pv <- ifelseis.na(s1), next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]
**Error**: unexpected ',' in "pv <- ifelseis.na(s1),"
ADD REPLY
0
Entering edit mode

it seems that there is a problem on the comment above where some brackets don't show up.

pv <- ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
ADD REPLY
0
Entering edit mode

Hi Tris, Thanks a lot for the correction.

ADD REPLY
0
Entering edit mode
pv <- ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
Error in ifelseis.na(s1), next, (round(1 - pchisq(s1$chisq, length(s1$n) -  : 
  no loop for break/next, jumping to top level

Can you tell me how to correct this error

ADD REPLY
0
Entering edit mode

how does your s1 object look?

ADD REPLY
0
Entering edit mode

s1 has 6 elements

> s1$chisq
[1] 1.528015
> s1$n
groups
event_rna[ind_gene, ind_tum]=0 event_rna[ind_gene, ind_tum]=1 
                           596                            497 
> s1$obs
[1] 50 54
> s1$exp
[1] 56.27811 47.72189
> s1$var
          [,1]      [,2]
[1,]  25.79472 -25.79472
[2,] -25.79472  25.79472
> s1$call
survdiff(formula = Surv(as.numeric(as.character(all_clin$new_death))[ind_clin], 
    all_clin$death_event[ind_clin]) ~ event_rna[ind_gene, ind_tum])
ADD REPLY
0
Entering edit mode

hmmm, ok, try to run s1 and summary(s1) you should get something like the following

> s1
Call:
survdiff(formula = f ~ a)

      N Observed Expected (O-E)^2/E (O-E)^2/V
a=1  77       44     45.5    0.0464    0.0794
a=2 106       69     67.5    0.0312    0.0794

 Chisq= 0.1  on 1 degrees of freedom, p= 0.778 
> summary(s1)
      Length Class  Mode   
n     2      table  numeric
obs   2      -none- numeric
exp   2      -none- numeric
var   4      -none- numeric
chisq 1      -none- numeric
call  2      -none- call

this is from a different analysis but I copied and pasted your code and it works fine

> ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),3)))[[1]]
[1] 0.778
ADD REPLY
4
Entering edit mode

How can perform survival analysis more than one genes like (TP53+PIK3)

ADD REPLY
0
Entering edit mode

see comments above where I addressed this question

ADD REPLY
3
Entering edit mode

Hi TriS,

Thank you very much for posting this. It was very helpful.

I hope you don't mind, but I decided to fork your code up to github, so that there's an example that's runs out of box. https://github.com/mforde84/RNAseq-Survival-Analysis-TCGA-KIRC

My results are slightly different than yours, so I would very much appreciate if you would review my code and see if anything erroneous sticks out. I simplified a number of things as well (like the input data set and preprocessing steps), just to help improve readability.

I think the differences are due to two things: 1)different versions of the RNAseq dataset in question (mine was accessed on Feb 2, 2017), and 2) specific R functionality has changed in less obvious ways which may break or alter the results of the analysis. For 1), collapsing and calculating the censors using the omics data provided generates slightly different results, but only for a handful of samples. Not much, but there certainly is a difference. For two, e.g., reading the clinical omics table into a R now creates an atomic character vector which doesn't utilize the $ operator, so later in your scripting you'll throw an error:

> clinical <- t(read.table("KIRC.merged_only_clinical_clin_format.txt", header=T, row.names=1, sep="\t"))
> typeof(clinical)
[1] "character"
> clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
Error in clinical$patient.bcr_patient_barcode : 
  $ operator is invalid for atomic vectors
> clinical <- data.frame(clinical) #working solution

Not a big deal. Just thought it would be helpful to keep something current for anyone who want to learn.

Thanks again, Martin

ADD REPLY
0
Entering edit mode

hi Martin

thank you for putting it on github, that's something I should have done a long time ago. and thank you for the updates as this post is more than one year old, I would expect to see some differences with what I posted at that time. clinical data are/should be updated and that will change, slightly, the results. Seb

ADD REPLY
0
Entering edit mode

Hello, Thanks for putting the code in github. I have seen the just now. But Could you please tell how to use this code for Disease free survival plot? Thank you

ADD REPLY
1
Entering edit mode

that is pretty simple, use disease free survival data, generally as DFS in the clinical data. then there should be a column saying "tumor free/with tumor".

ADD REPLY
0
Entering edit mode

Thankyou Tris. I have an other question. How to check the survival for High and Low expression of a gene? I have gene expression data of TP53 and divided the samples into two groups i.e. High expression and low expression. So, with this I want to see the survival b/w High and low. With your above code instead of normal and tumor, should I take High and low expression samples? Am I right?

ADD REPLY
0
Entering edit mode

you can assign groups to patients that have high/low expression, i.e. high=1, low=2, middle=0, remove the 0s and run survival as above.

as warm suggestion, look a lil more into how to run survival analyses in R, don't just copy and paste the code above without knowing what's doing what. that will be very helpful in the long (and short) run

ADD REPLY
1
Entering edit mode

@TriS thank you for this useful tutorial !

I think it would be better to change the lg2 function to:

lg2 <- function(x){
  x <- as.matrix(x)
  x <- t(apply(x,1,as.numeric))
  res <- log2((x+1))
  return(res)
}
ADD REPLY
0
Entering edit mode

@al3n70rn - thanks for the feedback, I'll change that. I was also looking into the VST transformation from DESeq2 which might even be a better choice. I'll add that asap

ADD REPLY
0
Entering edit mode

@TriS - Do you have a github account? It might be a good idea to post this script there so people can easily download and make code change suggestions via Github.

ADD REPLY
0
Entering edit mode

@alolex - that's a good idea ,thanks for the suggestion. I'll update the code this evening and then upload it on github as soon as I understand how that works.

ADD REPLY
1
Entering edit mode

I would like to know how can extend this code for survival analysis of multiple genes altogether or cancer subtypes ? I want to check effect of set of genes/ subtypes on survival probability through KM plot?

ADD REPLY
1
Entering edit mode

you should create a dummy variable in the survival analysis where you select patients with the gene signature altered

i.e. for genes A, B and C, you have a matrix m with row=genes, cols=patients:

gene_sig <- c(A,B,C)
rna_data_signature <- rna_data[gene_sig,]
event_rna <- apply(rna_data, 2, function(x) ifelse(sum(abs(x) > 1.96 >= 1),1,0))

not tested but it should work. the idea is to use apply to select, by column, patients that have at least one gene in the signature altered

you can than use the event_rna vector in the survival analysis

ADD REPLY
0
Entering edit mode

How can we make non-altered case since we need two groups -altered and unaltered for generating KM plot? for instance you take those genes which are altered at least once in all the patients then how can we make unaltered case?

ADD REPLY
0
Entering edit mode

if you look at the code I put in the answer, the patients with at least one alteration will be assigned a 1, if they have no alterations they will be assigned a zero. it basically checks column by column (= patient by patient) how many of those genes are altered, if the # of altered genes is > 1, then they enter the "altered" group.

ADD REPLY
1
Entering edit mode

Great tutorial, Thanks!

I had a problem when using the TCGA BRCA data, the rnal_log is not being created. So downstream analysis is not performed. It seems the n_index and t_index are formed and also the voom function is fine.

Thanks a lot for your help!

ADD REPLY
0
Entering edit mode

that is actually a typo on my side, my apologies, it's the rna_vm, not rna_log, that's probably leftover from the first version i wrote, without adding the latest edits.

thanks for catching it, i'm gonna fix it now

ADD REPLY
0
Entering edit mode

Great ! it worked now. Thanks a lot.

ADD REPLY
1
Entering edit mode

Hi, TriS. Your tutorial is very helpful.

I have done my own survival analysis using clinical file before merged and cleaned, but actually, the merge clinical file downloaded from the FireBrowse is more concise. Here I list two questions:

  1. "days_to_last_followup", you said that get the minmum because the it's the latest number.But the definition of "days_to_last_followup" is time interval from the date of last followup to the date of initial pathologic diagnosis, which means the the longer days represent latest followups, so the days should take the maximum number.

  2. to plot DFS survival curve, a vector contains censored or completed status should be created. In the merge clinical file, the column "patient.new_tumor_events.new_tumor_event_after_initial_treatment" defines the new tumor event status "yes", "no", or "NA".How to understand "NA" status? Does it mean the patient haven't occur new tumor event at the time of the last followup or means haven't survey the new tumor event status at the time of the last followup? The variable "all_clin$new_time" in your code shows you have regarded the "NA" as "haven't occur new tumor event"(the same as status "no").Is that Right?

Thanks.

ADD REPLY
0
Entering edit mode

hi cying, you made very good points and thank you for posting them. 1. yes, you are right! I overlooked the definition of "days to last followup" and you are correct, you should use the highest one. I will change the code accordingly 2. the NA in all_clin$new_time were counted as event that didn't happen, in this case no recurrence happened. to be honest I'm not sure about the difference between "no" and "NA" in "new_tumor_event_after_initial_treatement". i.e. there is an xls file for melanoma enrollment that annotates the field as "Indicate whether the patient had a new tumor event (e.g. metastatic, recurrent, or new primary tumor) after the date of initial melanoma diagnosis. If the tumor that yielded the sample submitted for TCGA was a new tumor event (i.e. it was diagnosed after the initial diagnosis), the remaining questions should not be completed. Please Note:If the patient did not have a new tumor event, the remaining questions can be skipped." therefore when they annotate for no/NA they might indicate the same thing, and it actually depends from who annotates the data. hope this helps!

ADD REPLY
0
Entering edit mode

That's a good point, maybe the the blank space of "new_tumor_event_after_initial_treatement" column is a hint of no new tumor events. Thanks!

ADD REPLY
1
Entering edit mode

link to the Firehouse should be updated

from

http://gdac.broadinstitute.org/)

to

http://gdac.broadinstitute.org/

excluding the closing paranthesis.

ADD REPLY
1
Entering edit mode

1) would be possible to get also the code which generates the DFS plot also?

2) is there somewhere some cleaned up and up to date version of this code?

3) when running this code (and fixing myself some bugs which still has like for example adding "clinical = data.frame(clinical)" ) the KIRK-OS plot looks slightly different in the sense that I get NotAltered = NA and also I get a lot of warnings when computing all_clin$new_time ,like for example, " In ifelseis.na(as.numeric(as.character(all_clin$new_tumor_days))[i]), ... : NAs introduced by coercion"). Otherwise the plot looks similar. Is this normal?

ADD REPLY
1
Entering edit mode

1) for the DFS plot you can use the DFS data in the clinical file. 2) no, the version here is what I had the time to create 3) you will get some warnings, if your code is correct these warnings come from the data and it's ok, some will be NA

ADD REPLY
1
Entering edit mode

thanks so much for the step-by-step tutorial, it has been very helpful for a biologist who is new to R and survival analysis.

I noticed that I am getting the following errors that don't allow for survfit() to function properly:

Error #1:

all_clin$new_death <- c() for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){ + all_clin$new_death[i] <- ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]), + as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i]) + }

Which gives me the following error:

In ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]),  ... :

NAs introduced by coercion

I read above that NAs shoudn't be a problem with this step, so I continued. However, when I get to the survfit function I get the second error

Error #2:

> s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])

Error in Surv(as.numeric(as.character(all_clin$new_death))[ind_clin], : Invalid status value, must be logical or numeric

Any help with this would be greatly appreciate, and I am sorry in advance if I don't understand your response right away, as I said, I am very new to R and survival analysis.

Thanks again for the great tutorial!

JP

ADD REPLY
1
Entering edit mode

I'm working on miRNA data analysis from GDC and getting error given below

vm <- function(x){
+     cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1,  0))
+     d <- model.matrix(~1+cond)
+     x <- t(apply(x,1,as.numeric))
+     ex <- voom(x,d,plot=F)
+     return(ex$E)
+ }
> rna_vm  <- vm(rna)

Show Traceback

Rerun with Debug

Error in voom(x, d, plot = F) :

Need at least two genes to fit a mean-variance trend

Thanks

ADD REPLY
1
Entering edit mode

Hey, I'm getting error in this script

if (x1 != 'NA' &amp; x2 != 'NA'){
  &nbsp; lines(c(0,x1),c(0.5,0.5),col='blue')
  &nbsp; lines(c(x1,x1),c(0,0.5),col='black')
  &nbsp; lines(c(x2,x2),c(0,0.5),col='red')
}
  

Getting such error in my R console

> if (x1 != 'NA' &amp ; x2 != 'NA'){
Error: unexpected ';' in "if (x1 != 'NA' &amp ;"
> & nbsp; lines(c(0,x1),c(0.5,0.5),col='blue')
Error: unexpected '&' in "&"
>   &nbsp; lines(c(x1,x1),c(0,0.5),col='black')
Error: unexpected '&' in "  &"
>   &nbsp; lines(c(x2,x2),c(0,0.5),col='red')
Error: unexpected '&' in "  &"
> }
Error: unexpected '}' in "}"

Please suggest how to solve this

ADD REPLY
2
Entering edit mode

You have character references in your code, rather than the characters themselves. & is the character code for & and   is the code for a non-breakable space (I had to put spaces in).

Try:

if (x1 != 'NA' & x2 != 'NA'){
    lines(c(0,x1),c(0.5,0.5),col='blue')
    lines(c(x1,x1),c(0,0.5),col='black')
    lines(c(x2,x2),c(0,0.5),col='red')
}
ADD REPLY
1
Entering edit mode

Why do we need to emove genes whose expression is == 0 in more than 50% of the samples at the beginning?

ADD REPLY
0
Entering edit mode

it's good practice to remove genes that have expression = 0 or very low expression in most samples. the % is up to you, this is an example, and I used 50%

ADD REPLY
1
Entering edit mode

this post was cited in :

Machine learning and Bioinformatics Models to Identify Gene Expression Patterns of Ovarian Cancer Associated with Disease Progression and Mortality

https://doi.org/10.1016/j.jbi.2019.103313

ADD REPLY
0
Entering edit mode

I tried running this with the BRCA data and get the following error on reading in the merged clinical data:

clinical <- t(read.table('Clinical/BRCA.merged_only_clinical_clin_format.txt',header=T,row.names=1,sep='\t'))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
    line 2 did not have 1099 elements
ADD REPLY
3
Entering edit mode

add

quote=""

to the read.table() function, that should fix it

ADD REPLY
0
Entering edit mode

Hi!

One last question: You wrote that 'you can divide the data into up- and down- regulated too if you wish';

I guess i have to change this line, but I'm not sure how to do that correctly:

event_rna <- t(apply(z_rna, 1, function(x) ifelse(abs(x) > 1.96,1,0)))

Thanks for help again!!

ADD REPLY
1
Entering edit mode
event_rna <- t(apply(z_rna, 1, function(x) ifelse(x > 1.96,1,ifelse(x < -1.96,2,0))))
ADD REPLY
0
Entering edit mode

Thank you so much for a great tutorial.

Sorry if these queries are a bit basic - probably need some more R learning!

Firstly, could I please ask how we would go about limiting this to, for example, five year overall survival?

Finally, how would I go about limiting the results to say just those who had radiotherapy (patient.radiation_therapy = yes)

Thanks again!

ADD REPLY
0
Entering edit mode

there are a couple of things you could do.

  1. keep the survival/follow up time only up to 5 years -> 5(yrs)*365(days)=1825(days)

  2. treatment data should be in the merged survival file, if not they are in one of the other files you can download. you need to create a dummy variable with therapy_yes=1 therapy_no=0 and select out only the patients with a 1..there are a million ways of selecting them out, this is just one of them.

  3. last but most important, you might want to look up some basic survival analysis in R and how to extract subsets of data. this sounds basic but it's vital since it's what allows you to do things correctly and what you will refer to when you want to do a lil trickier analysis like the one you are asking

ADD REPLY
0
Entering edit mode

Great tutorial! I am curious about the number of the altered and not-altered group, since the number of patients are 529, why the altered group are so large?

ADD REPLY
0
Entering edit mode

I suppose you refer to the numbers in the legend. those are not the number of patients but the median survival in days in the altered and not altered groups. to get the months you can do (ndays/365)*12

ADD REPLY
0
Entering edit mode

Hi all, great tutorial.

I have two questions:

1) When calculating p-values I get this error, similar as above:

pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]    #error
Error: unexpected symbol in 'pv <- if is.na'

2) Is there a method to find the relevant genes that are significant?

Best wishes,

R

ADD REPLY
0
Entering edit mode
  1. When calculating p-values I get this error, similar as above:

    pv <- if is.na(s1)) next else(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7))[[1]]     #error
    Error: unexpected symbol in "pv <- if is.na"
    

    your parenthesis are off, try this (not tested):

    pv <- ifelseis.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7)))[[1]]
    
  2. Is there a method to find the relevant genes that are significant?

    you mean differentially expressed genes? you might want to read a lil about z-scores and differential gene expression analysis or limma

ADD REPLY
0
Entering edit mode

Is it possible to run through all my rownames and print all corresponding plots and p-values:

ind_gene <- which(rownames(z_rna) == "TP53")
ADD REPLY
0
Entering edit mode

yes, using for loops and even writing it as a function :)

however, you might not want to print 15k to 30k plots, you might want to find a way to keep those you are interested in

ADD REPLY
0
Entering edit mode

Would you mind helping, I tried but got errors

ADD REPLY
2
Entering edit mode

I'm going to give you a pseudocode that you can apply to your analysis:

for 1 to # of genes
  run survival analysis
  extract p.value
  if p.value is significant & some other conditions
    plot the graph

...learning how to code does come with LOTS of trial and error, it's good that you got errors cause it can show you what you did wrong.

also, your question then becomes more general on the lines of 'how to apply a function to multiple genes', which can be 1) googled 2) posted on the general forum since it's not specific anymore to this tutorial

ADD REPLY
0
Entering edit mode

TriS Thank you so much for this tutorial. I found it very helpful.

I am interested in comparing the patients with overexpression of a gene and underexpression of that gene. To this end, I want to compare survival of patients with Z-score in the range <-1.96, versus [-1,1] versus > 1.96.

Is it correct to do the following?

event_rna <- t(apply(z_rna, 0, function(x) ifelse(x > 1.96,2,ifelse(x < -1.96,1,0)))) # for up and down regulated
ADD REPLY
1
Entering edit mode

yes, that should work, however, the apply function you wrote has an error/typo, you need to use "1" to apply the function row by row

event_rna <- t(apply(z_rna, 1, function(x) ifelse(x > 1.96,2,ifelse(x < -1.96,1,0))))
ADD REPLY
0
Entering edit mode

@TriS Thank you so much!

ADD REPLY
0
Entering edit mode

Deleted the comment and instead put it as a new comment

ADD REPLY
0
Entering edit mode

When I use the GBM data and the TP53, I get an error in the vm function (at the

d <- model.matrix(~1+cond)

step).

The error message is

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
    contrasts can be applied only to factors with 2 or more levels
Called from: `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]])

Any idea how to fix this?

Thanks

ADD REPLY
0
Entering edit mode

can you check that the n_index and t_index are not empty? also that cond does contain data?

ADD REPLY
0
Entering edit mode

This is most likely happening due to there being no normal samples in the dataset, so n_index is empty.

I don't really know enough about model.matrix to know if this is the correct approach, but by not computing the model.matrix and changing the following line:

ex <- voom(x,d,plot=F)

to:

ex <- voom(x,plot=F)

the voom function works. However the voom function states that by leaving out a design, the function "defaults to the unit vector meaning the samples are treated as replicates" and this may not be true.

ADD REPLY
0
Entering edit mode

The histogram of z-scores is not quite symmetric, it tends to have a heavier tail on the lower end (i.e. there are more genes below summit) so it does not look like that > 1.96 is equivalent to p=0.05 (my guess is that it is much lower than that. Similarly, I don't think that the Z < -1.96 is p=0.05 either (my guess is that it is much higher than that). Any suggestions how to deal with such asymmetric distribution? My inclination would be to use percentiles i.e. top 10% of the data versus bottom 10% of the data. However, I am not sure how to modify the code to do that. Any suggestions would be much appreciated.

I think the change has to be something like below but not sure if it is correct:

pr_thresh=quantile(z_rna, c(0.1, 0.9))

LOWTHRESH=pr_thresh[1]
HIGHTHRESH=pr_thresh[2]

event_rna <- t(apply(z_rna, 1, function(x) ifelse(x > HIGHTHRESH,2,ifelse(x <- LOWTHRESH,1,0))))

Any thoughts on the use of percentiles and/or on the implementation?

Thanks

ADD REPLY
1
Entering edit mode

The data won't always be perfectly symmetric, you can cut your data so that you have the same number of patients in each side in order to balance it out when doing survival analysis. however, when we are talking about hundreds of patients, small differences in n do not cause issues. some statisticians here use even z=+/-1.5 as long as it gives them symmetric/even groups.

The 1.96 threshold is not a must, cBioportal uses 2, you can use what you prefer and what fits your needs as long as there is a rationale behind it. also, the z scores here are calculated comparing T to N so you won't perfectly center the data around 0 as if you would scale the data within the same sample group.

For percentile you can use the top 10/20 etc...I'm not aware of a hard threshold, the important thing, as above, is that you have an explanation of why you pick that # (although rarely you read one in papers :)).

Lastly, if you are worried about data being skewed, you can also calculate skewness of the data in R and evaluate in in that way. a value +/- 3 is a no go...but I don't think you will get anywhere close to it with the Z scores from this tutorial

Hope this helps!

ADD REPLY
0
Entering edit mode

@TriS, Thank you for the explanation!

ADD REPLY
0
Entering edit mode

hi, I'm wondering if you used 10% threshold above, how to plot survival curve with only those two lines (> HIGHTHRESH vs < -LOWTHRESH). thanks

ADD REPLY
0
Entering edit mode

I am interested in looking only upto 5 years (1825 days).

I'm new to survival analysis.

Is it correct to discard all the samples whose max(days_to_last_known_alive, days_to_last_followup) is <=1825

by doing this:

all_clin=all_clin[all_clin$new_death <= 1825,]

Is this correct? (I have a feeling that it is not because I am simply throwing away patient data who have lived beyond 5 years). If the above approach is not correct, can you please tell me what is the right way to do it?

Thanks

ADD REPLY
0
Entering edit mode

code seems fine to me but I'm a lil confused...you want to look up to 5 years but then you don't want to throw away patients that have lived beyond 5 years...if you decide a threshold that'll be applied to both living and deceased patients and you will get a snapshot of what happened up to that point in time.

as a general suggestion, look up some stuff on survival analysis and data subsetting. although I'm glad to see that this tutorial helps, it gives the basis on how to do it and it can be much more useful when you can grasp the concepts behind censored survival analysis and how/why to do it and tweak it to your needs

ADD REPLY
0
Entering edit mode

Hi thanks for the great tutorial. Was there a particular reason you used the normalized gene count as opposed to the raw or scaled estimate? thank you!

ADD REPLY
0
Entering edit mode

yes and no. those data were available from Firehose but you can apply the same pipeline to other measurements i.e. raw counts

ADD REPLY
0
Entering edit mode

Thank you - reason why I ask is because I'm wondering what the reasoning behind using the the voom transformation? I would assume since the data is "normalized" wouldn't we be able to bypass this step? Sorry I'm just trying to understand this a bit more so I'm wondering if you can elaborate on this a bit more. thanks

ADD REPLY
0
Entering edit mode

No problem, asking is good. You could use other transformations too. voom transformation is very practical and better than i.e. just taking the log2. here and here there are a few info about why voom is a good method + you can also use it with RPKM, which were the data available from TCGA. however, if you had raw counts you could have used DESEq2 to transform the data, for the sake of this tutorial and this analysis I don't think it will make a big difference (note: this should be tested!!). I do prefer to normalize because of the skewenessd of the RNASeq/RPKM data which could inflate the actual differences between the different samples/genes.

ADD REPLY
0
Entering edit mode

you rock! thank you - btw you should put this on github.

ADD REPLY
0
Entering edit mode

I am very new to this kind of analysis and try to do my first using your great tutorial. I took a look at this link which I think is the paper introducing voom. It says that this method can be used with RPKM values, but RNAseq data you provided in this tutorial is normalized with RSEM method. Does this make any difference?

ADD REPLY
0
Entering edit mode

Hi quick question. Do you think it is better to eliminate the normal samples all together in the z-score calculations?

ADD REPLY
0
Entering edit mode

This tutorial is really helpful to me. Can you give some scripts on how to perform comparison like "Up-regulated" vs "Down-regulated"? It would be better to add plot and legend functions. I tried to mask all event_rna==0 as NA (event_rna was coded same as in your script) and then remove them. But I found most genes were removed, including my interested gene.

Thanks

ADD REPLY
2
Entering edit mode

You need a bit modification in above script:

library(survival)

z_rna (expression normalized data)

clinical data

then you have to calculate median of your gene

median(z_rna$yourgene)
rna_event <- ifelse(z_rna$yourgene >= (median), 1,0)     ## 1,0 ;  up & downregulated class

then follow similar above code...

ADD REPLY
0
Entering edit mode

Hi Mike,

I want to look the survival of a specific gene between high and low expression tumor samples. For this first I have to divide tumor samples into high and low expression samples.

Can I apply the code you gave above or it is different from that?

Thank you

ADD REPLY
0
Entering edit mode

Hi, I have a quick question about z-score you used. "mean gene X in normal" was used as "mean expression in diploid samples" in your script. Then should I only inquire paired tumor and normal samples in the TCGA data portal?

Thanks

ADD REPLY
0
Entering edit mode

Hi, thank you very much for your tutorials I have a quick question. I want to divide the tumor patients into two groups based on their mRNA expression (X gene) and then evaluate whether X gene mRNA expression level has any correlation with survival. Would you kindly tell me how to write the script? Thanks a lot : )

ADD REPLY
0
Entering edit mode

See my question above. Mike gave the exact answer and related script.

ADD REPLY
0
Entering edit mode

The error message is

clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)

Error in clinical$patient.bcr_patient_barcode : 
  $ operator is invalid for atomic vectors
ADD REPLY
0
Entering edit mode

Hey guy, would you please tell me how to solve it, thanks!

ADD REPLY
1
Entering edit mode

check

class(clinical)

if it's a matrix, transform it into a data.frame

ADD REPLY
0
Entering edit mode

Hi, Should I make any changes to the above script? because I am just interested in the cancer patients but not the normal controls? Thanks!

ADD REPLY
0
Entering edit mode

For this you need to further filter out samples from TCGA barcode , tumor types range from 01 - 09, normal types from 10 - 19 Control samples from 20 - 29, https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data.

ADD REPLY
0
Entering edit mode

Hi, Thank you very much for your help.Would you kindly tell me why the tumor types in this script are marked as 14? They should be 01-09. Thanks!

ADD REPLY
0
Entering edit mode

Hi zsucllj,

in this script 14 is not tumor types/sample, this is the two digits at position 14-15 of the barcode (eg. TCGA-02-0001-01 , including "-"),

plase have a look at figure & table (https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data)

ADD REPLY
0
Entering edit mode

if you are replying to the same question please use the "add comment" link :)

ADD REPLY
0
Entering edit mode

sorry TriS, I dont know whats different b/w "add comment" and "add reply". if my comments are not according to proper format , please delete them.

ADD REPLY
0
Entering edit mode

Hi, thank you very much for an informative tutorial! I apologize if this is an overly naive question, but I was wondering what new things could be learned from conducting your own survival analysis of TCGA data like in this tutorial when on Firehose there are already analyses of nearly every TCGA cancer data set including correlations between mRNAseq data and survival rates in their "Clinical Analysis" pages. Thanks!

ADD REPLY
0
Entering edit mode

When I look at correlations to survival with Firehose I find that they often list 0 genes correlating with survival. In contrast to their analysis, I have found thousands of genes that have significant p-values.

ADD REPLY
0
Entering edit mode

Hi,

Great tutorial and thank you very much!

I have some further comments and questions:

  1. If someone is willing to get another/additional data from other sources, it is very important that the patients in new_death, death_event and in the rna event will be in the same order in all of them (in this tutorial its unnecessary since all the data is sorted by the same way in gdac).
  2. How can it be that some patients have a value which is not NA in the death_days data, but in patient.vital_status is "alive" (e.g TCGA-A3-3347, or TCGA-A3-3325), in both cases patient.follow_ups.follow_up.vital_status shows "dead" but patient.vital_status shows "alive", however it seems to me as a bug in the updated data uploaded to gdac. What do you think? Do you understand what is the difference between patient.vital_status to patient.follow_ups.follow_up.vital_status? If you will agree that this is kind of a bug, I suggest the following correction:

    all_clin$death_event <- ifelse(all_clin$death_days =="NA" , 0,1)
    

    instead of:

    all_clin$death_event <- ifelse(clinical$patient.vital_status == "alive", 0,1)
    
  3. Just to make sure -
    1. the reason that you got NA in the median survival is because that most of them are still alive?
    2. Is the figure is up to date? because things are a bit difference (mean survival for altered is now 3554, for not altered is the same, the gdac data was upadated, do you believe that is the case?).
ADD REPLY
0
Entering edit mode

I program mainly in Python so I haven't tried running the code in this thread, but a patient vital status issue was raised recently for OncoLnc: https://github.com/OmnesRes/onco_lnc/issues/1

In this repository you will find Python code for extracting clinical data from files downloaded from the old TCGA data portal which were named "clinical_follow_up" and "clinical_patient".

For OncoLnc I had to parse the clinical data for 21 different cancers, and most of the time in the files when a patient is 'dead' they have '[Not Applicable]' written in the 'last_contact_days_to' column, but in some cancers and for some patients they list a number in that column even when a patient is dead.

Given that this issue was most prevalent for GBM and OV and I believe those are two of the earlier studies performed by TCGA I can't help but wonder if TCGA changed how they report the follow_up times at some point and this resulted in an inconsistent file format.

To answer your point 3 I did notice that I would get 'NA' for median survival when I had more patients listed as 'alive' and once I fixed the code I had much fewer 'NA' values for median survival.

ADD REPLY
0
Entering edit mode

Hello,

I have a question on the legend. I used this script in similar analysis except that I have a status column indication "altered" and "not altered". I was wondering if I use

# plot the data
 plot(survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum]),
 col=c(1:3), frame=F, lwd=2,main=paste("KIRK",rownames(z_rna)[ind_gene],sep="\n"))

that means the color of the graph will be black for "altered" and red for "not altered" (as per order of the factors). But doesn't manually adding the legend like this

# add legend
legend=c(paste("NotAltered=",x1),paste("Altered=",x2)),bty="n",cex=1.3,lwd=3,col=c("black","red"))

swaps the legends of the two graphs? Or maybe I am missing something here.

Greatly appreciate help!

Thanks

ADD REPLY
0
Entering edit mode

I feel something is wrong in your codes:

ind_tum <- which(unique(colnames(z_rna)) %in% rownames(all_clin))
ind_clin <- which(rownames(all_clin) %in% colnames(z_rna))

The patient's barcodes are not matched using the two ids:

rownames(all_clin)[ind_clin]
colnames(event_rna)[ind_tum]

Both all_clin and event_rna have duplicated samples, so a solution about the above problem is using the following codes:

all_clin <- all_clin[!duplicated(rownames(all_clin)), ]  # remove duplicated samples in all_clin
event_rna <- event_rna[, !duplicated(colnames(event_rna))] # remove duplicated samples in event_rna
commonId <- intersect(colnames(event_rna), rownames(all_clin))  # obtain the common samples
ind_tum <- which(colnames(event_rna) %in% commonId) 
ind_clin <- which(rownames(all_clin) %in% commonId)
ADD REPLY
0
Entering edit mode

lazymaggie, thank you for pointing this out. I will check the code and edit where necessary

ADD REPLY
0
Entering edit mode

I had a similar issue. Please check the names of your samples (colnames). There is a duplication there, due to very very similar names.

ADD REPLY
0
Entering edit mode

Thanks for writing this up.

A question about the input data however. I thought that illuminahiseq_rnaseqv2-RSEM_genes_normalized was not RPKM (or more correctly FPKM for paired end data), but actually a quantile normalised raw count provided by RSEM (see here https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2).

It would be important not to confuse the two.

ADD REPLY
0
Entering edit mode

it doesn't really matter in this case since they'll be transformed to zscores

ADD REPLY
0
Entering edit mode

So which one is it?

Maybe it doesn't make a difference for your workflow, but it may do for others. For example, using voom on quantile normalised data may not be optimal. Also, if someone was performing downstream analyses that aggregated data across genes (eg gene set enrichment), you wouldn't want them to think they were using RPKM when they were actually using counts.

It is worth noting for the many people (including students) reading this thread who may go on to perform a variety of another analyses. Please consider correcting your description of the input data (if there is an error?).

ADD REPLY
0
Entering edit mode

firebrowse (the file I suggested) provides you with RSEM values, not FPKM, not RPKM. for data normalization there are multiple ways of normalizing them and it depends from what you want to do. for the scope of this tutorial voom works well. if you prefer other methods DESeq2 or edgeR offer different flavors that might better fit your needs. also, bare in mind that what I described is a guide, not the only way of doing it.

ADD REPLY
0
Entering edit mode

Are you sure? The file you suggested has 'normalized count' written in the header. RSEM gives transcripts per million (TPM) by default, which is not the same as a normalized count

ADD REPLY
0
Entering edit mode

so, the normalized values are RSEM data divided by the 75th percentile and multiplied by 1000, as you correctly pointed out in the fist comment. to get TPM you have to multiply the RSEM estimate for 10^6 (RSEM paper) in this case, those are normalized RSEM values, not TPM, not RPKM, not FPKM. hope this helps

ADD REPLY
0
Entering edit mode

Hi, thanks again for a great tutorial. This might be a very basic question, but I was wondering if anyone could tell me what the numbers for Altered and NotAltered exactly refer to? Thanks!

ADD REPLY
0
Entering edit mode

number of samples with gene expression higher/lower than a specific threshold (i.e. z-score of 2) -> altered opposite -> not altered

ADD REPLY
0
Entering edit mode

Hi, in regard to the "days_to_new_tumor_event_after_initial_treatment", I think the smaller value was more reasonable than the higher one. The DFS was defined as the time from initial treatment until the date of new tumor event. And the first time of new tumor event should be selected instead of the last time. This was opposite to the "days of last follow-up".

ADD REPLY
0
Entering edit mode

hi cying, thank you for spotting that, you are correct. this code went through several updates, and that's one of those I didn't implement yet.

ADD REPLY
0
Entering edit mode

Great post. I have several questions/feedback:

  1. Why did you use KIRC.merged_only_clinical_clin_format.txt instead of KIRC.clin.merged.txt?
  2. There is a Clinical_Pick_Tier1 file when you go to FireBrowse > KIRC > Clinical. Do you know what it is and how it is different than the one you used? I just checked the final numbers for new_death and they correspond to days_to_death (for event dead) and days_to_last_followup (for event alive) combined in the Clinical_Pick_Tier1 file. If you use that file, you might save all the processing that you are doing to get the numbers for new_death.
  3. Although the numbers for survival match, your death_event is different than those given in the file (given as vital_status). I don't know how that file was processed so cannot comment on who is incorrect but wanted to put it out there.
  4. Why are you using min for new_tum_collapsed instead of max when you just want the highest value?
  5. For each of the collapse methods, I had to convert the columns to numeric for e.g. fl <- apply(fl, 2, as.numeric) otherwise I had character and factors which messed up the min/max values.

Thanks!

ADD REPLY
0
Entering edit mode
  1. you can use either, I just picked one. this tutorial is a guideline, not a rule
  2. I didn't know that was available at the time of this tutorial, thank you for bringing that up, it is very useful
  3. this post is almost a year and a half old, I would expect some differences in the events
  4. because I took the time at which the first tumor re-appeared, not the last one
  5. ok
ADD REPLY
0
Entering edit mode

Hi, thanks for your great tutorial. Could you help me a question about survival analysis.In clinical, i want to updata survival function in a dynamic way but lack of multiple gene values to fit such a longitudinal model,so i want to use landmark approach to update the prognosis.However, i am not sure whether a genomic variable is a time-varying effect covariate,which means the effect of the gene might vary with time. Thank you for your patience,and hope that you could share me more detailes.

ADD REPLY
0
Entering edit mode

this question goes beyond what's described in this tutorial, you should start a new question in Biostar for that.

ADD REPLY
0
Entering edit mode

Hi, Tris

Thanks for the great tutorial. it is exactly what I want to do on colorectal cancer datasets. I'm an extreme newbie to bioinformatics, being able to run it on my dataset would be something big already. but I seem to have some difficulties in plotting median survival, when I run it on either your original dataset or the colorectal cancer dataset it gives me error message as follow:

> x1 <- ifelseis.na(as.numeric(summary(s)$table[,'median'][1])),"NA",as.numeric(summary(s)$table[,'median'][1]))
> x2 <- as.numeric(summary(s)$table[,'median'][2])
> if(x1 != "NA" & x2 != "NA"){
+   lines(c(0,x1),c(0.5,0.5),col="blue")
+   lines(c(x1,x1),c(0,0.5),col="black")
+   lines(c(x2,x2),c(0,0.5),col="red")
+ }
Error in if (x1 != "NA" & x2 != "NA") { : 
  missing value where TRUE/FALSE needed

Would you mind help me to look into this, if you get a chance? I truly have no idea of how to correct this. Very much appreciated

Cheers,

ADD REPLY
0
Entering edit mode

Did you figure out how to do this ? I am getting exactly same problem as well !!

ADD REPLY
1
Entering edit mode

Check the values of x1 and x2 to figure out if those are as expected, something is wrong in those.

ADD REPLY
0
Entering edit mode
clinical <- t(read.table("Clinical/BRCA.merged_only_clinical_clin_format.txt", header = T, row.names = 1, sep = "\t"))

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :

line 1141 did not have 1098 elements

What should I do for the above error?

ADD REPLY
1
Entering edit mode

I moved your post to a comment (because it is no answer to the original thread).

Your 1141th line doesn't have the same number of elements (columns) as the rest of the file, so you should have a look at that line and see if something is odd.

awk 'NR==1141' yourfile.txt

I'm not sure if the header is counted as a line by R, so perhaps you should also check the next line:

awk 'NR==1142' yourfile.txt

and compare

Or is 1141/1142 the last line in your file containing whitespaces?
Have a look at the number of lines:

wc -l yourfile.txt
ADD REPLY
0
Entering edit mode

Just define quote

clinical <- t(read.table('Clinical/BRCA.merged_only_clinical_clin_format.txt',header=T, row.names=1, sep='\t',quote=""))

ADD REPLY
0
Entering edit mode

thanks so much for the step-by-step tutorial, it has been very helpful for a biologist who is new to R and survival analysis.

I noticed that I am getting the following errors that don't allow for survfit() to function properly:

Error #1:

all_clin$new_death <- c() for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){ + all_clin$new_death[i] <- ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]), + as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i]) + }

Which gives me the following error:

In ifelseis.na(as.numeric(as.character(all_clin$death_days))[i]),  ... :

NAs introduced by coercion

I read above that NAs shoudn't be a problem with this step, so I continued. However, when I get to the survfit function I get the second error

Error #2:

> s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])

Error in Surv(as.numeric(as.character(all_clin$new_death))[ind_clin], : Invalid status value, must be logical or numeric

Any help with this would be greatly appreciate, and I am sorry in advance if I don't understand your response right away, as I said, I am very new to R and survival analysis.

Thanks again for the great tutorial!

JP

PS- I am sorry if this comes up twice, I am new to biostars and still don't know the difference between answer/reply/comment.

ADD REPLY
0
Entering edit mode

I am not able to open this website, can you please provide me an alternative to download this data set. need it urgently. thanks in advance.

ADD REPLY
0
Entering edit mode

which website? firehose?

ADD REPLY
1
Entering edit mode

In your post (as remarked C: Survival analysis of TCGA patients integrating gene expression (RNASeq) data ) the closing parenthesis got included in the URL, leading to a 404. I have edited your post (added spaces) to solve this.

ADD REPLY
0
Entering edit mode

thank you I appreciate it, i didn't realize the parenthesis was included

ADD REPLY
0
Entering edit mode

Hi! Thanks for the tutorial! I'm new to the word of Bioinformatics and trying to get my head around everything.

Is there any documentation about what the different files on Firebrowse (http://gdac.broadinstitute.org/ ) contain? The tutorial links to a dead link of a wiki: https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode ). Does anyone know where the information from this wiki went?

I'm looking into potentially using one of those data sets for my masters thesis (I study computer science), but I'm a bit overwhelmed by all the files and little documentation.

Thanks a lot! Laura

ADD REPLY
0
Entering edit mode

the link was fixed, try using it now. basically those are information about which samples are normal or tumor

ADD REPLY
0
Entering edit mode

sorry moved to answers

ADD REPLY
0
Entering edit mode
   > vm <- function(x){
   +   cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1,  0))
    +  d <- model.matrix(~1+cond)
    +   x <- t(apply(x,1,as.numeric))
     +   ex <- voom(x,d,plot=F)
     +   return(ex$E)
   + }
  > rna_vm  <- vm(rna)

Error in if (any(allzero)) { : missing value where TRUE/FALSE needed In addition: Warning message: In apply(x, 1, as.numeric) : NAs introduced by coercion

I am getting above error while trying with Breast Cancer data set. Can you please help me in this ?

Thanks

ADD REPLY
0
Entering edit mode

I added the wrong comment

ADD REPLY
0
Entering edit mode
> for (i in 1:dim(new_tum)[1]){
+     ifsumis.na(new_tum[i,])) < dim(new_tum)[2]){
Error: unexpected ')' in:
"for (i in 1:dim(new_tum)[1]){
    ifsumis.na(new_tum[i,]))"
>         m <- min(new_tum[i,],na.rm=T)
Error: object 'i' not found
>         new_tum_collapsed <- c(new_tum_collapsed,m)
Error: object 'm' not found
>     } else {
Error: unexpected '}' in "    }"
>         new_tum_collapsed <- c(new_tum_collapsed,'NA')
>     }
Error: unexpected '}' in "    }"
> }
Error: unexpected '}' in "}"
 can you help why?
ADD REPLY
1
Entering edit mode
 for (i in 1:dim(new_tum)[1]){
   if ( sum ( is.na(new_tum[i,])) < dim(new_tum)[2]){
     m <- min(new_tum[i,],na.rm=T)
     new_tum_collapsed <- c(new_tum_collapsed,m)
   } else {
     new_tum_collapsed <- c(new_tum_collapsed,'NA')
   }
 }

You should open the parenthesis "(" before sum and is.na

ADD REPLY
0
Entering edit mode

As noted here, Could someone with edit priviledges change all the ifsumis.na to ifsumis.na, please

ADD REPLY
1
Entering edit mode

i'll modify it now...I'm not sure why that happens. it seems that some parenthesis are cut off. my apologies for the inconvenient

ADD REPLY
0
Entering edit mode

in addition there is ifelse here

all_clin$new_time <- c()
for (i in 1:length(as.numeric(as.character(all_clin$new_tumor_days)))){
  all_clin$new_time[i] <- ifelse( is.na(as.numeric(as.character(all_clin$new_tumor_days))[i]),
                    as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$new_tumor_days))[i])
}
ADD REPLY
0
Entering edit mode

Hi Could you please help understand why you are conditioning on normal samples when applying voom.

Is this step normalizing tumor gene expression based on expression of same gene in normal samples?

cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1, 0))

Thanks Adrian

ADD REPLY
0
Entering edit mode

the description is in the voom paper "voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Charity W Law, Yunshun Chen, Wei Shi and Gordon K Smyth"

for my understanding voom keeps in consideration the experimental design, i.e. treatments/replicates/etc, to weight variability across samples and conditions as a better way to estimate the mean-variance trend per each gene and therefore get a precise final value per each sample.

ADD REPLY
0
Entering edit mode
rna_vm  <- vm(rna)
 Show Traceback

 Rerun with Debug
 Error in voom(x, d, plot = F) : 
  Need at least two genes to fit a mean-variance trend

I am getting above error while running vm function. Can some one help me on this ?

ADD REPLY
0
Entering edit mode

this should be posted as a new question since it's not specific for this tutorial. anyhow..1) how does head(rna) look? 2) it seems you don't have a design matrix d 3) how many genes are in rna?

ADD REPLY
0
Entering edit mode

Hi Tris, do know whether will have your response so late. I am totally new in this and trying to use your codes to run the TCGA-LIHC dataset. Stuck here, could you kindly suggest the way to solve thanks a lot:

> clinical <- data.frame(clinical)
> clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
Error in `$<-.data.frame`(`*tmp*`, IDs, value = character(0)) : 
  replacement has 0 rows, data has 40
ADD REPLY
0
Entering edit mode

looks like you don't have a columns called

patient.bcr_patient_barcode

or it's empty, therefore you can't assign it to the variable you are creating.

ADD REPLY
0
Entering edit mode

I added the wrong comment

ADD REPLY
0
Entering edit mode

Hi TriS, As a new in this field I find your tutorial excellent. But I am having little trouble to execute the codes to analyze the TCGA data what you used. Can you please put all codes together with edited/updated version. Thanks Syed

ADD REPLY
2
Entering edit mode

this code is a guideline, not a copy and paste document. I'd suggest you to read a bit more about basic R usage and survival analysis before digging into this. it'll be time well spent

ADD REPLY
1
Entering edit mode

Perhaps you could mention where you encounter issues?

ADD REPLY
0
Entering edit mode

Hi Tris, thank you so much for this tutorial -- it is very informative and comprehensive. I do have one question -- how can this script be modified to just analyze survival differences based on differences in gene expression between cancer patients? This tutorial includes tumor versus matched-normal tissue. I guess I am confused about the purpose of this.

Your guidance would be much appreciated!

Thanks, F

ADD REPLY
0
Entering edit mode

you have to group patients using the parameters you chose (i.e. groups) instead of gene expression altered/not altered. I'd suggest you to read a bit more about basic survival analysis and how it works before digging into this one

ADD REPLY
0
Entering edit mode

Right, so I would want to separate patients into 2 groups (top X% of expression, bottom X% of expression) -- I'm relatively inexperienced with R. Is there a way to incorporate this into your script?

ADD REPLY
0
Entering edit mode

Hi Tris, I wonder if is it still correct if we apply this method for other distributions beside normal distribution?

ADD REPLY
0
Entering edit mode

Hi TriS, this is a great tutorial. I cloned the github repo and ran RNAseq-Survival-Analysis-TCGA-KIRC/survival_rnaseq_analysis.R (on a mac). However, I got survival p-value only 0.711 for TP53. Something is wrong, but how do I debug it? --K

ADD REPLY
0
Entering edit mode

How to survival analysis of two genes, such as TP53-PI3K+ ?

ADD REPLY
0
Entering edit mode

Did you ever figure this out? I am interested in this and can't find a lot of information on it.

ADD REPLY
0
Entering edit mode

you have to select patients based on the expression of two genes. - choose your two genes - classify patients based on the expression of both (i.e. tp53 > 2 & pi3k > 2) - that will be your altered class - the rest is not altered. or you can also count patients that are altered for one and not the other (note this is pseudocode):

group1 <- tp53 > 2
group2 <- pi3k > 2
group3 <- tp53 > 2 & pi3k > 2
group4 <- neither of the above

then pass those to the survival analysis

ADD REPLY
0
Entering edit mode

Is there is any formula to do do it automatically? I mean select all possible combinations. It is not a problem if you are interested in 2 genes but become much more complicated if you have 10 genes and want to correlate all possible combinations of them with survival.

ADD REPLY
0
Entering edit mode

that will require quite a lil bit of coding, I'd suggest you to refine your question first and then contact a bioinformatician/statistician to help you coding it

ADD REPLY
0
Entering edit mode

Hi, many thanks for the great tutorial. This command doesn't work for me: z = [(value gene X in tumor Y)-(mean gene X in normal)]/(standard deviation X in normal) It returns this error:

Error: unexpected '[' in "z = ["

Can you help me solve it? Thanks!

ADD REPLY
0
Entering edit mode

that's because you are trying to assign a variable starting with the square bracket, which in R is used to subset variables. I'd suggest to get familiar with R first, instead of copy-paste the code above. that will help you preventing lethal mistakes

ADD REPLY
0
Entering edit mode

Hi TriS,

Thank you so much for this tutorial! It is extremely helpful and insightful!

My apologies for the potential naive question I will be asking but for the KIRK data, I did the following code and get the following results:

> # check how many altered samples we have
> table(event_rna[ind_gene,])

>0   1 
>405 129

However, when i produce the graph I get the following, that doesn't match:

Is this normal? Am I missing something?

Please and thank you!

link to image: https://photos.app.goo.gl/sJUqVPLTJ2G8927Q6

ADD REPLY
0
Entering edit mode

Can you use all possible cutoff value between the lower and upper quartiles to compute and the best performing threshold to use as a cutoff?. Rather than using the simple medium value of gene expression.

If yes then, how to modify the above script?

I seen lots of difference in survival using medium value of gene and all possible cutoff value between upper and lower quartiles.

ADD REPLY
1
Entering edit mode

I just went through this myself. You can use different cutoffs (Tertiles, etc). Median is one of the more common cutoffs, I think this is mostly to divide the population evenly and retain all the samples. But if you look through some example papers you will see other cutoffs used. There is no hard definition of what is overexpressed and underexpressed. If you see an effect in patients that express your favorite gene above the upper tertile but not above the median that might just mean your gene needs to be highly expressed to create that effect. I would argue that trying a whole bunch of cutoffs to find an effect might fall under the category of finding data to fit a narrative, but I think it really depends on your question.

ADD REPLY
0
Entering edit mode

Why there is huge difference between estimated survival from this script and from cbioportal (https://www.cbioportal.org/)?

How to modify this script to estimate survival similar like cbioportal. (https://www.cbioportal.org/)

ADD REPLY
0
Entering edit mode

this post is now pretty old, the data on cbioportal have been updated. that's why you see a difference

ADD REPLY
0
Entering edit mode

Very nice explaination, refresh my point of view towards TCGA Survival analysis... one small suggestion: if it will be more readable by using dplyr instead of R base functions ? like clinical %>% select(contains("days_to_new_tumor_event_after_initial_treatment")) %>% filter_all(any_vars(!is.na(.))) %>% mutate(...) , sorry I am not very familier with dplyr but I believe it could get work done more efficiently

ADD REPLY
0
Entering edit mode

Hi TriS and community,

Thank you so much for this tutorial, it is being extremely helpful and insightful to me. How can i change the code for obtaining the z-score

    scal <- function(x,y){
  mean_n <- rowMeans(y)  # mean of normal
  sd_n <- apply(y,1,sd)  # SD of normal
  # z score as (value - mean normal)/SD normal
  res <- matrix(nrow=nrow(x), ncol=ncol(x))
  colnames(res) <- colnames(x)
  rownames(res) <- rownames(x)
  for(i in 1:dim(x)[1]){
    for(j in 1:dim(x)[2]){
      res[i,j] <- (x[i,j]-mean_n[i])/sd_n[i]
    }
  }
  return(res)
}
z_rna <- scal(rna_vm[,t_index],rna_vm[,n_index])

applying this formula? z = [(value gene X in tumor Y)-(mean gene X in tumor)]/(standard deviation X in tumor) since ACC data has no normal values, so no n_index. I solved voom transformation but I am stucked with this.

Can I just t(scale(t(rna_vm))) ?

Sorry for the naive question. Im new in R and bioinformatics. Thank you so much.

ADD REPLY
0
Entering edit mode

Thank you very much for this tutorial, I am new to bioinformatics, it is being extremely helpful and insightful to me. if I want to plot only upregulated and downregulated curve, analyse pv for these two group, how to modify the code? change here?

s <- survfit(Surv(as.numeric(as.character(all_clin$new_death))[ind_clin],all_clin$death_event[ind_clin])~event_rna[ind_gene,ind_tum])
pv <- ifelse ( is.na(s1),next,(round(1 - pchisq(s1$chisq, length(s1$n) - 1),7)))[[1]]
ADD REPLY
0
Entering edit mode

Please open a new question and reference this thread in that question.

ADD REPLY
2
Entering edit mode
7.5 years ago
xushutan ▴ 40

A webserver for TCGA data Breast cancer survival curve in different subtypes: luminal A, luminal B, Basal, Her2 and Normal-like. http://tcgaportal.org/

ADD COMMENT
0
Entering edit mode

Hello All,

I'm getting an error when I'm running my code for

Error in Surv(day, vital_status == "Dead") : 
  Time and status are different lengths

any idea?

ADD REPLY

Login before adding your answer.

Traffic: 2364 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6