Huge matrix memory issue
3
0
Entering edit mode
7.7 years ago
dr_bantz ▴ 110

Hi everyone,

I have a WGS dataset consisting of ~2000 samples. I want to look at the co-occurrence of mutations (ie. how often a given pair of genes is mutated in the same strain). The way I'm doing this requires the creation of a 30,000 X 30,000 matrix to represent all pairwise comparisons of genes in the genome. I tried this in R but it just stalls.

Any suggestions? I'm most proficient and R and Python but I've been looking for an excuse to learn C/C++/Java for ages so this might be a good opportunity. Also, I have access to a compute cluster (which includes, amongst other things, some 96GB nodes).

Thanks in advance for your help!

genome • 1.8k views
ADD COMMENT
0
Entering edit mode

What is it you're trying to do ? Do you want to compute this matrix or do you have it already and want to process it in some ways ? In the first instance, you could parallelize the computation of the matrix e.g. by row and reassemble the pieces you need in a full matrix at the end. If most of the elements are 0, you could try using a sparse matrix format (see for example the Matrix package in R). However, a 30 000 x 30 000 matrix should require "only" ~7 GB of RAM (assuming 8 bytes per cell).

ADD REPLY
0
Entering edit mode

Each i,jth cell would contain the number of times gene i and gene j are mutated in the same train, so this would be have to be computed. The plan is to then make a second matrix with the Poisson probability for the value i,j based on the occurrence of mutation in gene i and j. Your parallelization idea might work nicely.

ADD REPLY
0
Entering edit mode

R isn't anywhere near memory efficient. It's going to need more RAM than we can estimate. I suggest this task be computed as a list of genes, so each is a separate entity that does not need to be kept in memory. Call an optimized program like bedtools to calculate each row for you in one step.

ADD REPLY
0
Entering edit mode

At least on my Mac, R 3.3.2 appears to deal with matrices in an efficient way:

m <- matrix(numeric(2.3),nrow = 30000, ncol=30000) 
object.size(m)
7200000200 bytes

which is ~7 GB.

ADD REPLY
0
Entering edit mode

instead of a matrix can't you use a table with 3 columns LEFT TOP VALUE and then use the common linux tools ?

ADD REPLY
0
Entering edit mode
7.7 years ago
Girolamo ▴ 140

I assume you have a sparse matrix you could represent the matrix as a combination of three arrays http://btechsmartclass.com/DS/U1_T14.html another possibility is to use shelve https://docs.python.org/2/library/shelve.html

ADD COMMENT
0
Entering edit mode
7.7 years ago

Shane Neph and I came up with a tool called byte-store to store encoded pairwise score data on disk in one byte per comparison, as well as offer random access querying, calculating byte offsets from sort-bed-sorted BED coordinates of interest. The approach in byte-store is useful for generate-once-lookup-often operations, like Monte Carlo sampling, etc.

The byte-store tool includes Pearson r and Spearman's rho correlation scoring functions, which map scores between -1 and 1 to a single byte value. If you want to learn some C, you could write a function pointer that implements a custom pairwise scoring function for your dataset, and you might remap the correlation score range to your dataset's particular score range through an easy rescale function (you specify your min-max and remap a score to this range). Your dataset would be a 900MB file (30k * 30k bytes).

ADD COMMENT
0
Entering edit mode
7.7 years ago
dr_bantz ▴ 110

Thanks for all the suggestions! In the end I went for a row-by-row appraoch, as suggested by Jean-Karim Heriche karl.stamm. Using this approach as it is would take 10 days to run on the cluster. However, the approach has some helpful advantages, one of which being that I am now able to embarrassingly parallelize the process by assigning subsets of rows to different processors on the cluster. Another trick I used was to remove the first gene of each row after the row has been analysed. This prevents reciprocal gene-gene combinations from being counted twice and effectively converts the matrix into an upper-triangular matrix. Because the rows lower down will be shorter, they will take less time to process, so I can assign more and more rows to a given processor as I iterate through the rows. By employing the above, I got the run time down to ~7 hours.

ADD COMMENT

Login before adding your answer.

Traffic: 1891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6