Which R Packages, If Any, Are Best For Parallel Computing ?

12

Entering edit mode

15.1 years ago

Ian Simpson ▴ 960

I have started running R jobs on a high performance compute cluster, but inevitably find that I am just performing array jobs which are ultimately not taking advantage of true parallelism, although they do speed up the experiment.

I would like to start writing some parallel code in R where parallelized functions are available and wondered what people's experiences of this are and what packages you would recommend ?

r parallel • 23k views

ADD COMMENT • link updated 15.1 years ago by Cboettig ▴ 80 • written 15.1 years ago by Ian Simpson ▴ 960

1

Entering edit mode

Thanks everyone for the answers. In the end I'm going for Revolution-R as the best answer mainly because it is a big step forward for learning and programming in parallel with R. It includes at install the domC and multicore packages that Chris mentioned and achieves some pretty impressive speed ups on processes like matrix multiplication. Watch out for it hammering the CPUs though.

ADD REPLY • link 15.1 years ago by Ian Simpson ▴ 960

7

Entering edit mode

15.1 years ago

User 59 13k

Have you looked at REvolution R?

sudo apt-get install revolution-r

will get you up and running in Ubuntu in no time.

"REvolution R runs many computationally-intensive programs faster, especially on multiprocessor systems. REvolution R is built with high-performance compilers and linked with computational libraries that take advantage of multiple processors simultaneously to reduce the time to complete many common mathematical operations. You do not need to modify your code to benefit from these optimizations."

ADD COMMENT • link updated 6.7 years ago by Ram 45k • written 15.1 years ago by User 59 13k

0

Entering edit mode

Thanks Daniel, we don't have this on our HPC nodes just the canonical R release, that could be a cost thing. Not sure if it would be free if the University put it on a service, but that's something they could find out. I will definitely ask them about it and try it out on my local cluster in the meantime.

ADD REPLY • link 15.1 years ago by Ian Simpson ▴ 960

0

Entering edit mode

The non-Enterprise version that comes with the Ubuntu repositories is most definitely free, but I don't know what their 'commercial' offering actually extends beyond this other than support.

ADD REPLY • link 15.1 years ago by User 59 13k

5

Entering edit mode

15.1 years ago

Neilfws 49k

I know of two, but have not used either:

ADD COMMENT • link updated 5.1 years ago by Ram 45k • written 15.1 years ago by Neilfws 49k

0

Entering edit mode

Thanks Neil, I was just about to add that view link from CRAN.

ADD REPLY • link 15.1 years ago by Ian Simpson ▴ 960

0

Entering edit mode

How well does R scale these days (not counting the Revolution Enterprise version)?

ADD REPLY • link 15.1 years ago by Mndoci ★ 1.2k

0

Entering edit mode

As R is an interpreted language it basically will not beat an implementation in something like C or C++.

It also has memory issues (it loads a lot into memory). None of these are insurmountable and the availability of lots of powerful stats functionality in R makes it attractive.

If a piece of the code I'm working with is problematic I implement it in C and then access it from R. If memory is an issue I use something like 'ff' or read data in from a database as I need it. The alternative I guess is to use proprietary software like SAS or Matlab.

ADD REPLY • link 15.1 years ago by Ian Simpson ▴ 960

4

Entering edit mode

15.1 years ago

Giovanni M Dall'Olio 28k

Maybe this question is best suited for stackoverflow.

Have a look at this discussion and follow this search.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 15.1 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Thanks Giovanni, I hadn't looked there. I guess I was specifically looking for experiences from Bioinformaticians, but these links look very useful.

ADD REPLY • link 15.1 years ago by Ian Simpson ▴ 960

3

Entering edit mode

15.1 years ago

Istvan Albert 102k

Fundamentally the most important element of parallel computing revolves around requirement of inter-process communication.

There are many problems that require no such communication and thus can be simply be parallelized by splitting the input data into chunks and running multiple instances of the program in question. I don't personally consider this "parallel" computing but others call it as such. Many of the solutions you'll find for R are handy convenience functions for starting up R as new processes then collecting the results of their run.

The true parallel computing revolves around the ability to quickly exchange data across different parallel processes. These are necessary when one process needs some results computed in a different process. Most of the time this requires specialized libraries or computing models and is not something that I would recommend one to undertake as a side project.

There are some libraries written to take advantage of multiple cores. This is called implicit parallelism. In this case while the original may be a single threaded program, some internal functions are able to perform over multiple cores.

Your primary course of action is to identify whether the problem that you wish to parallelize can be partitioned just by its input data and/or whether the functionality that you need is available via implicit parallelism. If so you have many straightforward solutions. If not then the solution will be a lot more complicated.

ADD COMMENT • link updated 5.1 years ago by Ram 45k • written 15.1 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks for that. I am currently solving the problem by the former i.e. splitting, batching then collating and performing summary analyses. In the mid->long term however I need the latter. There are already some emerging, truly parallelized, functions that I can take advantage of that will deliver significant speed up. My question specifically is trying to ascertain what if anything people are using in terms of packages and R with HPC resources. There are a few suggestions here that I am following up. Even simple functions in parallel form such as 'sort' can be extremely useful.

ADD REPLY • link 15.1 years ago by Ian Simpson ▴ 960

0

Entering edit mode

Yes indeed, adding parallelism to the bottlenecks will give you the maximum benefit.

ADD REPLY • link 15.1 years ago by Istvan Albert 102k

3

Entering edit mode

15.1 years ago

Bertrand ▴ 30

If you know a little python or are interested in learning, you can use the mpi4py and rpy packages. The first one provides access to the MPI library for parallel computing very simply and the second allows you to use R from within your python program. With both you can do a lot ...

ADD COMMENT • link 15.1 years ago by Bertrand ▴ 30

3

Entering edit mode

15.0 years ago

Zach Stednick ▴ 660

I have mostly been using the snow package for running jobs in parallel. Its lightweight and easy to setup quickly (and teach to other people) and it is able to handle a variety of analysis jobs.

ADD COMMENT • link 15.0 years ago by Zach Stednick ▴ 660

3

Entering edit mode

15.0 years ago

Cboettig ▴ 80

After some experimenting, I've found the snowfall package the fastest and most effective way to implement parallel processing jobs through R. It provides an easier to use interface to the popular snow package. The R-sig-hpc mailing list for high performance computing is another great resource for troubleshooting parallel computing applications in R. I typically write the most computationally intensive steps in C and then write an R wrapper to give me the best of both worlds -- compiled speed and interactive environment of a scripting language. I will often handle the parallelization at the R level though, as it is easier and more flexible than parallelizing in C directly (via openMP for me).

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 15.0 years ago by Cboettig ▴ 80

2

Entering edit mode

15.1 years ago

Chris Miller 22k

I'm dealing with some of the same problems right now - trying to adapt an R package to multi-core machines. I've had some luck using the multicore/doMC and foreach packages. They essentially take a for loop and parcel out the iterations to multiple cores. This is essentially splitting the input data, not the more implicit parallelism, but seems to work fairly well. This approach doesn't solve the problem of splitting jobs among multiple cluster nodes, though.

I also looked at R/parallel, but had major problems getting it to work. Lots of cryptic error messages and failure in simple cases that looked just like the vignettes. I can't recommend it.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 15.1 years ago by Chris Miller 22k

0

Entering edit mode

Thanks Chris that's extremely helpful I'll have a look at multicore/doMC. Currently I am using one script to generate a batch array and another to pick up the pieces afterwards !

ADD REPLY • link 15.1 years ago by Ian Simpson ▴ 960

2

Entering edit mode

15.1 years ago

Stew ★ 1.4k

I tried multicore today and was up and running on multiple processors in minutes. I think it is best suited to utilizing unused cores on a desktop machine working locally. Though I am now also using it on an LSF managed compute cluster by requesting multiple cores for my R jobs (with the -n option).

Is almost as simple as doing a find and replace for lapply with mclapply, almost. That is if your code used lapply of course. Multicore also has parallel and collect functions which make it easy to split any functions across cores, not just with lapply.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 15.1 years ago by Stew ★ 1.4k