Question

R: Error In Pvclust Function While Clustering

1

Entering edit mode

11.9 years ago

Diana ▴ 930

Hi all,

I'm trying to cluster RNA-seq data using pvclust function from pvclust package, it gives me this error: cannot allocate vector of length 1623767616 I'm wondering if this is because I have 40296 genes and its too much data?

My code is this:

test2<-read.csv("RNAseq_to_cluster.csv", sep=",")
test3<-test2[,2:4]  #columns contain samples
row.names(test3)<-test2$gene
matrix<-data.matrix(test3)
transpose= t(matrix)
pv <- pvclust(transpose, method.dist="correlation", method.hclust="average", nboot=1000)

Error in cor(x, method = "pearson", use = use.cor) : 
  cannot allocate vector of length 1623767616

EDIT: first few lines of the input file:

gene    sample1    sample2    sample3
Mar-01    4.19504    3.9006    4.15683
Mar-02    3.0554    3.4261    3.76675
Sep-02    77.1536    65.1284    76.4927
Mar-03    1.01555    1.28626    0.461987

Please help.

Thanks!

r clustering • 5.5k views

ADD COMMENT • link updated 11.9 years ago by Damian Kao 16k • written 11.9 years ago by Diana ▴ 930

0

Entering edit mode

Yeah there isn't enough memory to make a vector of that size. But I don't see why it would need to make a vector of that size for what you are doing. Can you post the first few lines of the csv input file?

ADD REPLY • link 11.9 years ago by Damian Kao 16k

0

Entering edit mode

I've posted a few lines of the input file

ADD REPLY • link 11.9 years ago by Diana ▴ 930

0

Entering edit mode

Try repeating with less number of genes, to get an answer. I assume, you have reached the R memory limit of 4GB. Check this post and post for possible workarounds.

ADD REPLY • link 11.9 years ago by Sukhi Singh 11k

0

Entering edit mode

Statistically it's not a great idea to blow up a 40k × 3 dataset into a 40k × 40k correlation matrix

ADD REPLY • link 11.9 years ago by Ben ★ 2.0k

score 0 · Answer 1 · 2013-02-06

0

Entering edit mode

11.9 years ago

Damian Kao 16k

I don't think you need to do much to your data input to run the pvclust function. The transposition of the data matrix might be the problem. Instead of finding pair-wise correlation for just 3 sets of data (sample1,2,3), the transposition might be telling pvclust to do it for 40,000 sets of data (genes).

Try just this:

data = as.matrix(read.csv('RNAseq_to_cluster.csv',sep=',',header=TRUE, row.name = 1))
pv <- pvclust(data, method.dist="correlation", method.hclust="average", nboot=1000)

ADD COMMENT • link 11.9 years ago by Damian Kao 16k

0

Entering edit mode

pvclust clusters columns that's why I was using the transpose function otherwise it just clusters the samples whereas I want to cluster the genes according to their expression profiles in the 3 samples

ADD REPLY • link 11.9 years ago by Diana ▴ 930

0

Entering edit mode

I see. I skimmed through pvclust description and thought you just wanted to cluster by sample. Perhaps the package just wasn't designed to cluster that many columns? Are you specifically interested in the p-values pvclust generates? If not, there are plenty of generic hierarchical clustering scripts out there that will handle large amount of genes and run faster. Clustering using python's scipy is pretty fast. You might want to look at this also: http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm

ADD REPLY • link 11.9 years ago by Damian Kao 16k