I have a WGS dataset consisting of ~2000 samples. I want to look at the co-occurrence of mutations (ie. how often a given pair of genes is mutated in the same strain). The way I'm doing this requires the creation of a 30,000 X 30,000 matrix to represent all pairwise comparisons of genes in the genome. I tried this in R but it just stalls.
Any suggestions? I'm most proficient and R and Python but I've been looking for an excuse to learn C/C++/Java for ages so this might be a good opportunity. Also, I have access to a compute cluster (which includes, amongst other things, some 96GB nodes).
What is it you're trying to do ? Do you want to compute this matrix or do you have it already and want to process it in some ways ?
In the first instance, you could parallelize the computation of the matrix e.g. by row and reassemble the pieces you need in a full matrix at the end.
If most of the elements are 0, you could try using a sparse matrix format (see for example the Matrix package in R).
However, a 30 000 x 30 000 matrix should require "only" ~7 GB of RAM (assuming 8 bytes per cell).
Each i,jth cell would contain the number of times gene i and gene j are mutated in the same train, so this would be have to be computed. The plan is to then make a second matrix with the Poisson probability for the value i,j based on the occurrence of mutation in gene i and j. Your parallelization idea might work nicely.
R isn't anywhere near memory efficient. It's going to need more RAM than we can estimate.
I suggest this task be computed as a list of genes, so each is a separate entity that does not need to be kept in memory. Call an optimized program like bedtools to calculate each row for you in one step.
Shane Neph and I came up with a tool called byte-store to store encoded pairwise score data on disk in one byte per comparison, as well as offer random access querying, calculating byte offsets from sort-bed-sorted BED coordinates of interest. The approach in byte-store is useful for generate-once-lookup-often operations, like Monte Carlo sampling, etc.
The byte-store tool includes Pearson r and Spearman's rho correlation scoring functions, which map scores between -1 and 1 to a single byte value. If you want to learn some C, you could write a function pointer that implements a custom pairwise scoring function for your dataset, and you might remap the correlation score range to your dataset's particular score range through an easy rescale function (you specify your min-max and remap a score to this range). Your dataset would be a 900MB file (30k * 30k bytes).
Thanks for all the suggestions! In the end I went for a row-by-row appraoch, as suggested by Jean-Karim Heriche karl.stamm. Using this approach as it is would take 10 days to run on the cluster. However, the approach has some helpful advantages, one of which being that I am now able to embarrassingly parallelize the process by assigning subsets of rows to different processors on the cluster. Another trick I used was to remove the first gene of each row after the row has been analysed. This prevents reciprocal gene-gene combinations from being counted twice and effectively converts the matrix into an upper-triangular matrix. Because the rows lower down will be shorter, they will take less time to process, so I can assign more and more rows to a given processor as I iterate through the rows. By employing the above, I got the run time down to ~7 hours.
What is it you're trying to do ? Do you want to compute this matrix or do you have it already and want to process it in some ways ? In the first instance, you could parallelize the computation of the matrix e.g. by row and reassemble the pieces you need in a full matrix at the end. If most of the elements are 0, you could try using a sparse matrix format (see for example the Matrix package in R). However, a 30 000 x 30 000 matrix should require "only" ~7 GB of RAM (assuming 8 bytes per cell).
Each i,jth cell would contain the number of times gene i and gene j are mutated in the same train, so this would be have to be computed. The plan is to then make a second matrix with the Poisson probability for the value i,j based on the occurrence of mutation in gene i and j. Your parallelization idea might work nicely.
R isn't anywhere near memory efficient. It's going to need more RAM than we can estimate. I suggest this task be computed as a list of genes, so each is a separate entity that does not need to be kept in memory. Call an optimized program like bedtools to calculate each row for you in one step.
At least on my Mac, R 3.3.2 appears to deal with matrices in an efficient way:
which is ~7 GB.
instead of a matrix can't you use a table with 3 columns
LEFT TOP VALUE
and then use the common linux tools ?