I have several hundred scRNA-seq count matrices, each from a different sample. For my other dataset containg a few dozen samples, I simply merged everything together into one Seurat object, but that won’t work here as far as I can tell. When I try to merge them in the typical way, I get the error:
Error in .cbind2Csp(x, y) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 89
This basically means I ran out of RAM. I even ran it on the cluster with 1 TB of RAM, and I got the same error. This is unsurprising, as memory requirements increase exponentially in relation to matrix size. It seems that this happens when the merge() function attempts to convert all the sparse matrices to dense matrices. Here is my code:
seurat_object_list <- "path/to/matrices" |> list.dirs(recursive = FALSE) |> lapply(\(matrix_folder) {
matrix_folder |> Seurat::Read10X() |> Seurat::CreateSeuratObject()
})
seurat <- merge(seurat_object_list[[1]], seurat_object_list[2 : length(seurat_object_list)])
The Seurat object list is created with no issue, but it always fails on the merge. I tested this code with a smaller subset of the matrices, and it works just fine. It seems that I just have too many cells to process in this manner.
At this point, It looks like I have a few options.
- I could hack together some way to force the matrices into one Seurat object. But I don't know if that would even be useful, since I would likely keep bumping my head on the same memory contraints whenever I try to do anything with the combined object, such as running SCTransform() or FindMarkers().
- I could divide the samples into groups that are small enough to merge, and process them in the normal way. But this imposes limitations on the types of comparisons/analyses I could perform.
- I could keep the matrices separate and analyze them iteratively. This would solve the memory problems, but I haven't found any info online about anyone analyzing count matrices this way. I imagine that I would lose a lot of out-of-the-box functionality that comes with the standard single-cell R libraries. For example, I'm not sure how I would normalize all the gene counts against each other as separate objects. This option seems like it might require a lot of re-inventing the wheel and a lot of code.
I like option 3 the best, but I'm not sure where to begin. I am still pretty new to bioninformatics, so I would love to hear some input from those who are more experienced. How do you deal with memory constraints when dealing with a large number of cells/samples? Are there any standard ways to deal with this problem? Is there some tutorial/doc online that I failed to find? Any advice you could offer is greatly appreciated.
Not necessarily. It could mean that there is a hard-coded limit in the
.c
or.h
files that can't fit your merged matrix. That would be the first thing I'd look into.Separately, it may help if you explicitly declare your dense matrices as being populated by (long-)integers. If they are treated as matrices of double type, that would increase their sizes several times.
Cross-posted: https://bioinformatics.stackexchange.com/questions/19025/how-to-manage-memory-contraints-when-analyzing-a-large-number-of-gene-count-matr
If you add up all columns together, what dimensions would we be talking about? So how many rows in columns theoretically if everything was merged?
ATpoint
36k genes by 2 million cells.
Yes, I cross-posted. Just trying all my avenues. Is that against the rules?
Yeah, that’s not going to fit into memory. See for example http://bioconductor.org/books/3.14/OSCA.advanced/dealing-with-big-data.html
No, cross-posting is not forbidden, but it’s appreciated to indicate it by adding a link, to avoid double-effort for users.