Thanks everyone. The Docker suggestion is excellent and let me look into it.
Regarding the benefit of providing code and making them accessible at one place -
(i) many of these programs are not available at github,
(ii) I like to mix and match some of these codes, or further develop them. Currently the only way to do that is to start adding to one of these existing code-bases or start a third (N-th) assembler and write a paper claiming that it is better than all others. I do not find those solutions very productive.
Docker is indeed useful for integrated systems (e.g. the old apache+mysql+php combo), complex pipelines or for perl/python scripts with many dependencies. However, for small C/C++ projects, docker is overkilling. Most of the tools in the list can be compiled into a few standalone executables. In this case, distributing statically linked precompiled binaries is the easiest to end users. Docker adds little. On the contrary, because many "managed" computing clusters (actually all the three clusters I have access) don't have a linux kernel recent enough to run docker, using docker actually hurts usability.
I sometimes put precompiled binaries here for personal uses. I believe a repository of precompiled biotools will benefit many. I don't have time committed to that, though.
ADD REPLY
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
lh3
33k
0
Entering edit mode
I have been through the codes of most of the mentioned programs. Some are indeed quite small (both source code and executables) and some others are unnecessarily big. On top of that, some of the big ones are using Boost in many different ways, and that makes my life very difficult, when I have to collect and install them individually from various sources. Also, in some cases, the developers stopped working on the codes altogether after publishing it.
If the binary is not dynamically linked against boost libraries, endusers would not know the program uses boost or not. The binary is still easy to use. That is the merit of distributing precompiled binaries. IMHO, any C/C++ programs that require boost, cmake, C++11 or other non-standard libraries should offer binaries.
Yes, statically-linked binaries are ideal. One issue I've run into though is that it's not always possible to do this. For example; Sailfish and Salmon both make use of Intel's excellent Threading Building Blocks (TBB) library. It's a fairly mature library in which Intel has made available high-quality and efficient implementations of many parallel and concurrent data structures and algorithms (and some of those wheels are painful to re-invent). However, TBB cannot (at least currently) be statically linked. I know because this was the last remaining non-static library against which I linked, and I tried for quite a while to figure out how to build a static version before reading about how it is currently not possible / supported.
Another issue is that while most binaries can be statically linked on Linux, that seems to have been made intentionally difficult (and in some cases not possible) on OSX (then again, building on OSX can be a whole other world of pain for many different reasons).
edit: I'm only arguing here about the statically linked part --- I completely agree that "big" programs (and even easy-to-compile ones) should offer binaries for ease-of-use.
ADD REPLY
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
Rob
6.9k
0
Entering edit mode
You are right. Some libraries disallow static link. Intel's cilk is another example. Solaris also provides libc.so only, as I remember. Statically linked gcc may not work on a different linux, either. The argument is that sometimes a library depends on the kernel or components that are not or cannot be statically linked. Static link may cause some weird bugs. This is more likely to happen to low-level system libraries.
Rob, there needs to be some balance between elegance of coding and ease of use. It is often hard to make biologists learn and use new programs. If you introduce difficulty of compilation or other factors, even that small pool of early users are gone.
I would recommend you to write sailfish without tbb, if possible. The algorithm is fairly straightforward (using perfect hash to store kmers, counting and expectation-minimization). Can the code use simpler libraries so that it can be compiled as easily as BWA, HMMER or DALIGNER?
I don't disagree here --- this is the reason, for example, that I distribute binaries for sailfish and salmon and the reason I've gone to considerable effort to set up a system where every successful commit of salmon (i.e. every commit that builds and tests successfully using Travis-CI) is automatically packaged into a binary and uploaded to GitHub. It's also the reason that I'm working to remove the dependence on the Shark machine learning library, the inclusion of which has resulted in the most headaches in building the software.
However, there is a third criterion that is actually quite important that is neither elegance of coding nor ease of use. That is efficiency of implementation. What I mean by this, is that it's a significant engineering effort e.g. to come up with a concurrent map or a concurrent MPMC queue that is close to the efficiency of the one provided by TBB. In these cases, relying on existing, well-engineered libraries both saves implementation time and results in potentially (much) more efficient programs.
All of that being said, I am strongly in favor of reducing those dependencies as much as possible. For example, I've found a similar, stand-alone MPMC queue that performs as well as (and sometimes better than) the one provided by TBB. And I'm actively working on trying to reduce any reliance on "tricky" libraries. Somewhat interestingly, I think this will be easier to do (in the short term) for salmon than for sailfish, and so I may separate them at some point in order to reduce the build complexity of salmon. Anyway, thanks for the advice, and, in the large, I overwhelmingly agree.
ADD REPLY
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
Rob
6.9k
0
Entering edit mode
Docker is a really good tool for making the process of building a given software environment reproducible. It lets you start from a known good environment (the base image) so you don't have to deal with an unpredictable mix of operating system, installed packages, and software version on every host where you want to install your packages. A few examples:
Arvados uses Docker to provide portable runtime environments. Bioinformaticians can build and test a Docker image containing the necessary tools on their workstation, then upload the Docker image to the Arvados environment to run analysis on a cluster. This allows users to use new tools (or different versions of tools or libraries) without having to ask (and wait for) the cluster administrators to install the necessary software.
The Common Workflow Language also supports Docker for specifying how to run tools at a higher level in a portable way.
bcbio-nextgen uses CloudBioLinux and Docker to build an entire environment for running genomics pipelines from scratch.
I would recommend checking out the bio-linux repository as most of those you've listed (all?) are already packaged and maintained (dependencies included).
You can install individually via the PPA or apt-get, or install the whole OS which is a ubuntu system with the packages pre-installed.
ADD COMMENT
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
Daniel
★
4.0k
0
Entering edit mode
most of those you've listed (all?) are already packaged
I will go for the third possibility - 'very few' :) Biolinux seems to have avoided most assembly-related programs.
I think that including a curated list of programs and possibly a script/recipe to download the programs would be better than actually including the source code. Also, something along the lines of Docker, as Jeremy said, may be easier for users to actually get a working install.
Some issues to think about are how to keep all the code up to date and avoid license conflicts (since you licensed this project). I'm not sure what advantages there are to providing all the code, when most of it is on github already, compared to the possible issues. I'm not trying to be discouraging, just some things to consider that might simplify the process of maintaining and expanding the list.
ADD COMMENT
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
SES
8.6k
0
Entering edit mode
I imagine the benefit to providing the code directly (rather than using submodules within git or just having the user download code) is version control. Having said that, I too am on the "use Docker" bandwagon for this problem.
That is a good point, not all of the packages are on github. While that would make things easier for users, my concern would be keeping all the code up to date. I don't think it would be trivial because it would involve some combination of git/svn and manual downloads. What if the code version is behind and people run into bugs that have been fixed? That could annoy the developers. I could imagine the opposite happening also, people using this repo as a place to report issues and bugs with the code.
ADD REPLY
• link
updated 2.6 years ago by
Ram
44k
•
written 9.8 years ago by
SES
8.6k
Thanks everyone. The Docker suggestion is excellent and let me look into it.
Regarding the benefit of providing code and making them accessible at one place -
(i) many of these programs are not available at github,
(ii) I like to mix and match some of these codes, or further develop them. Currently the only way to do that is to start adding to one of these existing code-bases or start a third (N-th) assembler and write a paper claiming that it is better than all others. I do not find those solutions very productive.
Docker is indeed useful for integrated systems (e.g. the old apache+mysql+php combo), complex pipelines or for perl/python scripts with many dependencies. However, for small C/C++ projects, docker is overkilling. Most of the tools in the list can be compiled into a few standalone executables. In this case, distributing statically linked precompiled binaries is the easiest to end users. Docker adds little. On the contrary, because many "managed" computing clusters (actually all the three clusters I have access) don't have a linux kernel recent enough to run docker, using docker actually hurts usability.
I sometimes put precompiled binaries here for personal uses. I believe a repository of precompiled biotools will benefit many. I don't have time committed to that, though.
I have been through the codes of most of the mentioned programs. Some are indeed quite small (both source code and executables) and some others are unnecessarily big. On top of that, some of the big ones are using Boost in many different ways, and that makes my life very difficult, when I have to collect and install them individually from various sources. Also, in some cases, the developers stopped working on the codes altogether after publishing it.
I will post the binaries too, as you suggested.
If the binary is not dynamically linked against boost libraries, endusers would not know the program uses boost or not. The binary is still easy to use. That is the merit of distributing precompiled binaries. IMHO, any C/C++ programs that require boost, cmake, C++11 or other non-standard libraries should offer binaries.
Yes, statically-linked binaries are ideal. One issue I've run into though is that it's not always possible to do this. For example; Sailfish and Salmon both make use of Intel's excellent Threading Building Blocks (TBB) library. It's a fairly mature library in which Intel has made available high-quality and efficient implementations of many parallel and concurrent data structures and algorithms (and some of those wheels are painful to re-invent). However, TBB cannot (at least currently) be statically linked. I know because this was the last remaining non-static library against which I linked, and I tried for quite a while to figure out how to build a static version before reading about how it is currently not possible / supported.
Another issue is that while most binaries can be statically linked on Linux, that seems to have been made intentionally difficult (and in some cases not possible) on OSX (then again, building on OSX can be a whole other world of pain for many different reasons).
edit: I'm only arguing here about the statically linked part --- I completely agree that "big" programs (and even easy-to-compile ones) should offer binaries for ease-of-use.
You are right. Some libraries disallow static link. Intel's cilk is another example. Solaris also provides libc.so only, as I remember. Statically linked gcc may not work on a different linux, either. The argument is that sometimes a library depends on the kernel or components that are not or cannot be statically linked. Static link may cause some weird bugs. This is more likely to happen to low-level system libraries.
Rob, there needs to be some balance between elegance of coding and ease of use. It is often hard to make biologists learn and use new programs. If you introduce difficulty of compilation or other factors, even that small pool of early users are gone.
I would recommend you to write sailfish without tbb, if possible. The algorithm is fairly straightforward (using perfect hash to store kmers, counting and expectation-minimization). Can the code use simpler libraries so that it can be compiled as easily as BWA, HMMER or DALIGNER?
I don't disagree here --- this is the reason, for example, that I distribute binaries for sailfish and salmon and the reason I've gone to considerable effort to set up a system where every successful commit of salmon (i.e. every commit that builds and tests successfully using Travis-CI) is automatically packaged into a binary and uploaded to GitHub. It's also the reason that I'm working to remove the dependence on the Shark machine learning library, the inclusion of which has resulted in the most headaches in building the software.
However, there is a third criterion that is actually quite important that is neither elegance of coding nor ease of use. That is efficiency of implementation. What I mean by this, is that it's a significant engineering effort e.g. to come up with a concurrent map or a concurrent MPMC queue that is close to the efficiency of the one provided by TBB. In these cases, relying on existing, well-engineered libraries both saves implementation time and results in potentially (much) more efficient programs.
All of that being said, I am strongly in favor of reducing those dependencies as much as possible. For example, I've found a similar, stand-alone MPMC queue that performs as well as (and sometimes better than) the one provided by TBB. And I'm actively working on trying to reduce any reliance on "tricky" libraries. Somewhat interestingly, I think this will be easier to do (in the short term) for salmon than for sailfish, and so I may separate them at some point in order to reduce the build complexity of salmon. Anyway, thanks for the advice, and, in the large, I overwhelmingly agree.
Docker is a really good tool for making the process of building a given software environment reproducible. It lets you start from a known good environment (the base image) so you don't have to deal with an unpredictable mix of operating system, installed packages, and software version on every host where you want to install your packages. A few examples:
Arvados uses Docker to provide portable runtime environments. Bioinformaticians can build and test a Docker image containing the necessary tools on their workstation, then upload the Docker image to the Arvados environment to run analysis on a cluster. This allows users to use new tools (or different versions of tools or libraries) without having to ask (and wait for) the cluster administrators to install the necessary software.
The Common Workflow Language also supports Docker for specifying how to run tools at a higher level in a portable way.
bcbio-nextgen uses CloudBioLinux and Docker to build an entire environment for running genomics pipelines from scratch.