Hi everyone,
I'm currently handling the TCGA clinical data aligned with gene sequencing data by R. I want to analyze the logrank-p value for every gene in order to find those genes related to OS of the patients. However, when I tried to import the sequencing data into R, and as.matrix() it, it seems to be too large and was showed in Rstudio "a LARGE MATRIX (7657690 elements, 59.6Mb)" which is a 20530* 373 matrix. And I tried to use apply() on this matrix and found the rows were not numeric (if you do the subsetting one row from it, you would find it was still classified as a data.frame) Consequently, I screen only 150 genes out of 20530 a time. 150* 373 seems to be fair, but super insufficient.
I hope I make myself clear. :-P . Is there any method to handle such a large matrix?
Would be very appreciated if anyone can help me.
Below is my codes for analyzing the logrank-p-value
# I used excel to align the clinical data with the expression first
> dat<-read.csv('TCGA_EXP.csv',header =TRUE,row.names= 1,sep = ",") #20530 genes
> rt<-read.csv('TCGA_COAD_OS.csv',header =TRUE,row.names= 1,sep = ",") #contains only the rows of OS time and status of patients
> rt<-t(rt)
> os_years<-rt[,2]
> os_status<-rt[,1]
>mat.part1<-as.matrix(dat[1:100,]) #here I can only convert 100 genes into matrix or it says it is a large matrix and each row, when you do the subsetting, you find it not numeric.
> library(survival)
>my.surv<-Surv(os_years,os_status==1)
>log_rank_p <- apply(mat.part1, 1, function(values1){
+ group=ifelse(values1>median(values1),'high','low')
+ kmfit2 <- survfit(my.surv~group)
+ #plot(kmfit2)
+ data.survdiff=survdiff(my.surv~group)
+ p.val = 1 - pchisq(data.survdiff$chisq, length(data.survdiff$n) - 1)
+ })
> names(log_rank_p[log_rank_p<0.05])
You are going to get some flak for that. Don't use excel for bioinformatics analysis, review of files is ok.
Next time hightlight the text you want to present as code and then use the "101" button in the edit window to format it as such. I have done this for you now.
See if this helps: https://stackoverflow.com/questions/21628718/creating-large-matrix-in-r
Hi genomax, Thanks a lot for the reply. Honestly I'm very new to R and always want to achieve what I want by all means. Very helpful suggestion, I will not try to once more use excel to pre process the data downloaded from TCGA and by using R instead.
Thanks for the moderation on the code. Really appreciated.
what is your exact problem. 20530* 373 matrix is a normal sized matrix in R. Your bottleneck lies somewhere else.
Hi Santosh
Really appreciated it! You are right!
The problem is not about the size. It is because in my expression matrix, there are genes showing constant 0 read, so when using survdiff(), there exists no "high" or "low" group.
What does the "dat" matrix/table contain? Does it have column names? How is missing data coded? Are you sure the table only contains numerical data?
Thank you mbuvcm for inspiring me on this one.
Your are right, the problem lies somewhere in the matrix. The matrix contained indeed pure numeric content without NA.
There are genes showing constant 0 read, so when using survdiff(), there exists no "high" or "low" group.
Hello world! I found out my problem! If you want to screen the Logrank-p value by using the code above, please pay attention to pre processing your expression matrix!
The expression matrix should contain no NA and no constant 0 for a gene. And of course it should be perfectly aligned with the clinical survival data. And you CAN screen all the logrankp for matrix in this size.
Thanks again to those who helped me!