vector memory exhausted running dist() on a single ADT dataset
1
0
Entering edit mode
5.2 years ago
cook.675 ▴ 230

This is a cross-post from the Satija github forum; I thought I may get more eyes on this forum so I'm also posting here:

Session info:
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Im running a macbook with 8GB of RAM. I am following this vignette for CITESeq However I am only loading and working with the ADT data. I am starting the vignette from the "Cluster directly on protein levels" section.

Everything is fine until I get to the following command: adt.dist <- dist(t(adt.data)) which returns the following: Error: vector memory exhausted (limit reached?)

I have tried setting my R_MAX_VSIZE variable anywhere from 8GB to 700GB as suggested on stackoverflow when trying to troubleshoot this. I also check this value is correct when I load up R using Sys.getenv("R_MAX_VSIZE")

In order to maximize my chances of success, just prior to the troublesome line of code being executed I have cleared all unused objects from the work space and I also ran garbage collection.

When I do this I run mem_used() and it returns a value of 387 MB. object.size(adt.data) is the only variable in the workspace prior to running dist() and it returns a value of 212MB

I can't think of anything else to try. It doesn't feel like my machine is incapable of running this, it doesn't seem that big. Is there another solution to this problem? Please let me know if you'd like any additional information. Thanks so much!

Edit: I just tried running it on a friends machine and the same error came up only it said:

Error: cannot allocate vector of size 2025.0 Gb

Well I guess that's the problem.... I don't have 2 Tb.... is there a way to shrink this or run an alternate type of PCA in order to do the clustering with the ADT data alone?

Edit 2: I just tried changing R_MAX_VSIZE to 2200Gb and rerunning. The program accepted it and I let it cook for awhile and came back an hour later and got the following message:

R session Aborted. R encountered a fatal error. The session was terminated
seurat RNA-Seq • 3.1k views
ADD COMMENT
1
Entering edit mode
5.2 years ago

This error means that R can't find enough continuous memory. You may have enough RAM in total but if it's fragmented, you may only have access to small continuous bits. To mitigate the issue from within a program, one should try to allocate objects by decreasing size (i.e. big matrices first) and lifetime so that smaller objects can fit in the footprint left when larger ones are destroyed though this may not always be practical when it's not your own code. I've occasionally had success running gc() but it may have worked just by chance. Sometimes, the solution may be to reboot if other (often long-running) processes are holding up memory.

ADD COMMENT
0
Entering edit mode

Yah Ive tried everything mentioned. Rebooting, closing every application, garbage collection, clearing all unused an unnessecary variables, increasing max allowable memory....

I tried it on a desktop we have with 32GB ram on windows 10 and I have the same error, I cant over come it no matter what. Im not sure what the next step is I guess submitting it to our campus computing cluster but I was really trying to avoid that.

the line right before the error is adt.data <- GetAssayData(Adt, slot = "data")) and I tried adding "as.sparse" right before the function call but that didn't help at all

ADD REPLY
0
Entering edit mode

What is the dimension of the distance matrix you're trying to compute? Is it possible that you're not computing on the expected data (e.g. computing dist between rows vs between columns or wrong data in the data frame)? Also check that when starting from the middle of the vignette you haven't missed any data preprocessing steps.

ADD REPLY
0
Entering edit mode

What is the dimension of the distance matrix you're trying to compute?

I'm not sure exactly. I think I'm still trying to understand the dist() function to figure this out. The dimensions of the matrix passed to dist() are 25 x 737280. What does the "t" do here? dist(t(x)) ?

is it possible that you're not computing on the expected data

Yes I will look into this some more presently......

Also check that when starting from the middle of the vignette you haven't missed any data preprocessing steps.

Double and triple checked it should be alright

ADD REPLY
0
Entering edit mode

t() takes the transpose so t(x) is in your case a 737280 x 25 matrix. dist(x) computes the distance between the rows of x so dist(t(x)) in your case computes a 737280 x 737280 distance matrix which would take ~4TB of RAM if stored as a dense matrix or ~2TB as a dist object.

ADD REPLY
0
Entering edit mode

I see thanks for that, then yes that would be the correct data since the 25 columns are just identifiers and we would want to calculate dist between rows. I've been busy and haven't had a chance to send run this on our computing cluster to see if it will work but I will report back.

ADD REPLY

Login before adding your answer.

Traffic: 1801 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6