Hi all,
working in R, I need to merge 24 big arrays (on average 2.5 millions of points each array and stored as RData) in order to compute overall statistics as mean, median and percentiles. Uploading all into memory is not feasible, thus I was wondering if you can suggest me some strategy to face this problem. I read about ff package but I cannot find usage examples that fit my problem.
could you please make the connection to bioinformatics explicit? I can guess there are a million reason why a bioinformatician would need those, but sort of have this requirement on this site. Otherwise you might find the solution on Stackoverflow already.
If memory limitation is the issue for your 24*2.5 million points, you may want to take a look at the HDF5 array package. The HDF5Matrix object in particular is advertised to support standard matrix operations like rowSums, colSums over large on-disk matrices.
There's the bigmemory package. You could also read the data in chunks (as in the mainframe days). There are algorithms to compute running statistics.
Alternatively, do the work in a memory-efficient language like C.
Hi Nicola,
could you please make the connection to bioinformatics explicit? I can guess there are a million reason why a bioinformatician would need those, but sort of have this requirement on this site. Otherwise you might find the solution on Stackoverflow already.