Question

Problem with LARGE MATRIX

0

Entering edit mode

7.5 years ago

zhangyunjie1992 ▴ 30

Hi everyone,

I'm currently handling the TCGA clinical data aligned with gene sequencing data by R. I want to analyze the logrank-p value for every gene in order to find those genes related to OS of the patients. However, when I tried to import the sequencing data into R, and as.matrix() it, it seems to be too large and was showed in Rstudio "a LARGE MATRIX (7657690 elements, 59.6Mb)" which is a 20530* 373 matrix. And I tried to use apply() on this matrix and found the rows were not numeric (if you do the subsetting one row from it, you would find it was still classified as a data.frame) Consequently, I screen only 150 genes out of 20530 a time. 150* 373 seems to be fair, but super insufficient.

I hope I make myself clear. :-P . Is there any method to handle such a large matrix?

Would be very appreciated if anyone can help me.

Below is my codes for analyzing the logrank-p-value

 # I used excel to align the clinical data with the expression first
> dat<-read.csv('TCGA_EXP.csv',header =TRUE,row.names= 1,sep = ",") #20530 genes
> rt<-read.csv('TCGA_COAD_OS.csv',header =TRUE,row.names= 1,sep = ",") #contains only the rows of OS time and status of patients
> rt<-t(rt)
> os_years<-rt[,2] 
> os_status<-rt[,1]
>mat.part1<-as.matrix(dat[1:100,]) #here I can only convert 100 genes into matrix or it says it is a large matrix and each row, when you do the subsetting, you find it not numeric.
> library(survival)
>my.surv<-Surv(os_years,os_status==1)
>log_rank_p <- apply(mat.part1, 1, function(values1){
+     group=ifelse(values1>median(values1),'high','low')
+     kmfit2 <- survfit(my.surv~group)
+     #plot(kmfit2)
+     data.survdiff=survdiff(my.surv~group)
+     p.val = 1 - pchisq(data.survdiff$chisq, length(data.survdiff$n) - 1)
+ })
> names(log_rank_p[log_rank_p<0.05])

R gene k-m • 3.2k views

ADD COMMENT • link 7.5 years ago by zhangyunjie1992 ▴ 30

2

Entering edit mode

I used excel to align the clinical data with the expression

You are going to get some flak for that. Don't use excel for bioinformatics analysis, review of files is ok.

Next time hightlight the text you want to present as code and then use the "101" button in the edit window to format it as such. I have done this for you now.

See if this helps: https://stackoverflow.com/questions/21628718/creating-large-matrix-in-r

ADD REPLY • link 7.5 years ago by GenoMax 149k

0

Entering edit mode

Hi genomax, Thanks a lot for the reply. Honestly I'm very new to R and always want to achieve what I want by all means. Very helpful suggestion, I will not try to once more use excel to pre process the data downloaded from TCGA and by using R instead.

Thanks for the moderation on the code. Really appreciated.

ADD REPLY • link 7.5 years ago by zhangyunjie1992 ▴ 30

2

Entering edit mode

what is your exact problem. 20530* 373 matrix is a normal sized matrix in R. Your bottleneck lies somewhere else.

ADD REPLY • link 7.5 years ago by Santosh Anand 5.8k

0

Entering edit mode

Hi Santosh

Really appreciated it! You are right!

The problem is not about the size. It is because in my expression matrix, there are genes showing constant 0 read, so when using survdiff(), there exists no "high" or "low" group.

ADD REPLY • link 7.5 years ago by zhangyunjie1992 ▴ 30

1

Entering edit mode

What does the "dat" matrix/table contain? Does it have column names? How is missing data coded? Are you sure the table only contains numerical data?

ADD REPLY • link 7.5 years ago by christopher medway ▴ 460

1

Entering edit mode

Thank you mbuvcm for inspiring me on this one.

Your are right, the problem lies somewhere in the matrix. The matrix contained indeed pure numeric content without NA.

There are genes showing constant 0 read, so when using survdiff(), there exists no "high" or "low" group.

ADD REPLY • link 7.5 years ago by zhangyunjie1992 ▴ 30

0

Entering edit mode

Hello world! I found out my problem! If you want to screen the Logrank-p value by using the code above, please pay attention to pre processing your expression matrix!

The expression matrix should contain no NA and no constant 0 for a gene. And of course it should be perfectly aligned with the clinical survival data. And you CAN screen all the logrankp for matrix in this size.

Thanks again to those who helped me!

ADD REPLY • link 7.5 years ago by zhangyunjie1992 ▴ 30