Do You Trust Bio-Xxx Projects ?
9
15
Entering edit mode
14.0 years ago
toni ★ 2.2k

Hi all,

This is a simple and general question, kind of a survey. At the beginning of a development project, in bioinformatics, we all need to choose a language first, then if there are existing modules, it is always good to know that we are not reinventing the wheels and that we can make our own code rely on already written modules (from BioJava, BioPython, BioPerl projects...)

I don't want to diminish the huge work that has been done for many years now but I always have a weird feeling before going further with those modules. It's like "Do I write my scripts from scratch (and maybe loosing lot of time) OR do I take the risk to face hundreds of warnings or bugs (and get the impression of being lost in Bioperl stuff instead of really developing) ?"

So, 2 brief questions :

  1. Do you trust Bio-xxx projects ? What is your approach in general ?
  2. Consider BioJava, BioPython, BioPerl: are they equally advanced projects or do BioPerl (or another) have far more capacities at the present time (bigger community behind it) ?

Thanks for your advises.

T.

biopython bioperl biojava subjective • 11k views
ADD COMMENT
35
Entering edit mode
14.0 years ago

I view this as a tradeoff. I would not say that I do not trust the Bio.* projects. In my opinion there are several factors that should be considered before uncritically using these modules in your code:

  1. How much work can the modules save me? If most of what I want to do can be done by a few calls to the right functions, it is a no brainer that one should do so. However, if Bio.* can parse a FASTA file for me, but I need to implement everything else myself, I might as well write a FASTA parser myself and avoid the external dependencies.

  2. Is this a run-and-forget script or do I expect to reuse it? If I just need to run the script once, I would happily use any module that can save me a few lines of code. However, if I am writing a software that I expect to reuse for a long time, external dependencies are more of a liability since they may be subject to changes that could break my code in the future. Yes, you can call me paranoid if you like ;-)

  3. Do I plan to distribute the software to others? Again, the more people I plan to distribute my software to, the more of a problem external modules are. Merely having to install a big Bio. project to run my code may cause me to lose users. The users are likely to have various outdated versions of Bio. installed, some of which may have bugs in functions that I rely on, making testing much more of a burden than if I had not had external dependencies.

  4. How crucial is speed? With all due respect to the programmers, speed does not appear to be their primary focus. If I am coding something where speed matters, I have found that I rarely can rely on such big projects. Often the higher level of object oriented abstraction has a price in terms of execution speed, and I thus do not rely on Bio.* projects if speed is of the essence.

As you can see, I do not distrust the Bio.* projects per se. I am just hesitant to believe that they can eliminate most of my code, that the functions are stable over time, that they are bug-free in all releases, and that they are fast enough for all purposes.

Regarding the different projects, I have mostly positive experiences with BioPerl, a few unpleasant experiences with BioPython (specially regarding speed or lack thereof), and no experience with BioJava and BioRuby.

ADD COMMENT
0
Entering edit mode

+1 for mentioning speed.

ADD REPLY
0
Entering edit mode

+1 very good answer, you read my mind:-)

ADD REPLY
0
Entering edit mode

I totaly agree with you and I add this : BioPerl seems to be widely used and maintained by the community... but what about BioC++ ? The last sources, I found, have been updated in 2006 ! How can I rely on such library.

ADD REPLY
0
Entering edit mode

+1 for "speed" from my side too.

ADD REPLY
0
Entering edit mode

+1 for Speed AND I raise you Memory. Reading EVERY homology search hit into an object is an effective way of pulling down a machine.

ADD REPLY
0
Entering edit mode

Good point regarding memory - I should have mentioned that too :-)

ADD REPLY
9
Entering edit mode
14.0 years ago
lh3 33k

I trust the correctness of implementation, but I do not always trust the coding quality. I almost never use Bio* projects nowadays, but when I used a bit before, I tend to read the source code first. If the module is written by a good programmer like Lincoln, it is usually of high quality, clear and efficient. I would use. However, sometimes I may jump into messy codes, overcomplicated and inefficient. In that case, I would rather use my own. IMHO, BioPerl focus too much on features but not enough on efficiency which I always care about. The same might be true for many collaborative open source projects.

EDIT: while I was writing this post, a senior scientist in my room was telling a young postdoc: "you should build your own library now". I totally agree: if writing programs is a significant part of your daily work, you should reinvent the wheel.

EDIT2: I see the discussion above about "reinventing the wheel", so I want to comment here, though it is a little bit off-topic. Firstly by reimplementation, we usually mean reimplement simple things within a couple of hundred lines of code, instead of something worth a publication. Writing a simple module only takes a small amount of time for one project. Secondly, we frequently take "wasting time" as an excuse for not implementing something by ourselves, even things as simple as a fasta/q parser, but this is not necessarily true. We may improve our programming skills and save time in the long run. Libraries are like knowledges which are relatively easy to pick up quickly, but skills usually take much longer to build up. Thirdly, when we do by ourselves, we would be clearer about potential pitfalls, which helps research.

ADD COMMENT
7
Entering edit mode
14.0 years ago

I would claim the majority (maybe 75%?) of BioPerl and the other Bio* projects are parsers for the output of compiled executables (binary programs). The other major part is componentry designed as part of some larger project like gbrowse. There are actually very few Bio* modules that implement algorithms, for example.

http://www.bioperl.org/w/index.php?title=Category:Modules

The word "trust" in those parsing cases implies, for example, "do I trust BioPerl to parse a Blast file correctly?" In most cases I would trust it better than I would trust myself to write a Blast parser simply because more weird boundary conditions are encountered by the user community than I would encounter in testing. Part of the problem is people writing programs whose output is not easily parsed, but that is another issue.

The backlash against Bio* is that the framework can often be more complex than the actual data being parsed, but using the framework pays dividends if, for example, you switch aligners.

ADD COMMENT
6
Entering edit mode
14.0 years ago

'bioinformatics' is a very general term which includes a wide range of different fields; therefore, it is very difficult to develop a single library to solve the needs of all bioinformaticians at once, and this is what the Bio* project fail to see now.

All the Bio* projects like BioPerl and BioPython are organized as a single module with a wide range of different functions, that you have to install as a block in order to use. For example, if you need to use the PhyloXML parser from BioPython, you need to install also the BioPython libraries to parse Fasta files, to query the NCBI databases, etc..

This is a very different situation compared to Bioconductor, where you have many independent libraries, including some redundant and some abandoned, but where you can install only the library you need, and where you can choose between different solutions to the same problem. From this point of view, I don't trust them when I think that they have not found a solution to this matter yet.

From another point of view, I like the Bio* projects because I recognize the effort they have done in making the bioinformatics community more compact. Thanks to BioPython, I learned that rather than developing my custom Fasta format parser I should use a library which is tested carefully before being released and that is being used by other people, making my results more comparable. From that, I am a great supporter of the Bio* projects. Moreover, these projects are also a place to meet other programmer working in the same field and to discuss on the best implementation for each algorithm, which is something that would not be so easy without them.

In any case, regarding your question, my answer is: I will trust more your results if you use the Bio* libraries than if you develop the library by yourself. The Bio* libraries are tested very carefully before being released, and are used by many people in different environments and on different projects. Moreover, if an error is found in a Bio* library, then it will be easier to know which experiments have used the Bio* version with the bug and repeat the analysis with the correct implementation. So, if I were the reviewer of your paper, I will trust more the Bio* libraries than your custom implementation.

ADD COMMENT
1
Entering edit mode

Good point about trusting the results from those who use Bio* results over those using legacy scripts. I've used BioPerl and BioRuby and both had steadfast requirements of having unit testing so the results generated would be the same across many machines. One of the nice things I've seen in BioRuby is the ability to install single modules and their dependencies (i.e. only sge or only blast) keeping your overall codebase a little more lean. I'm sure the same could be/is done in Perl/Python.

ADD REPLY
5
Entering edit mode
14.0 years ago

sure they are trustworthy! not reinventing the wheel should be the mantra of any programmer, and bioinformatics in particular. I tend to forget about this great resources when doing minor programs, but when it comes to communicating with databases, parsing data or integrating different bioinformatics modules they are definitely of great help.

I can only talk about BioPerl, since I haven't used any of the others (maybe because when I started doing bioinformatics back in 2001 Perl was the language favoured by all major bioinformatics projects, and also because it was the best way - almost the only one - of dealing transparently with EBI data), and all I can say is positive. sure you are relying on others' work which may not be perfect, but the fact that a large community is using (and curating) it makes me very comfortable at relying on it.

from my point of view, your question should be "why wouldn't you trust Bio-XXX projects?", since there are more pros than cons in order to decide using these resources rather than not using them. of course always depending on the size and aim of your project, as mentioned above.

ADD COMMENT
3
Entering edit mode

"not reinventing the wheel should be the mantra of any programmer". Not agree :-) I DO like reinventing the wheel ! Because it makes me wonder about how an algorithm should be implemented, I appreciate how my initial design was wrong and how their model was right :-)

ADD REPLY
3
Entering edit mode
ADD REPLY
3
Entering edit mode

@Pierre: it is fine to improve our programming skills, but only to a certain point. Imagine if everybody were to rewrite every single parser and every library for each different project: it would be a real nightmare to compare the results from different publications. Moreover, you should consider that the main job of a bioinformatician is not to become a good programmer, but rather to do research: if you are paid with a public fellowship, you can't afford to spend months of your time to reimplement something that already exists, it won't be fair toward those how pay your salary.

ADD REPLY
0
Entering edit mode

well, your point is also clear. you shouldn't use a pre-built module right away if you want to learn how to do it yourself. and of course you should evaluate whether a designed module does what it does as it should be done. sure someone may come out with a better way of communicating with NCBI data if he tries hard, but I'm sure that the majority of the people wanting to communicate with NCBI would prefer trusting the comm API and focusing mainly in the data itself. but I have to agree with you: once I got the skills, I always tended to build things from scratch for a better understanding.

ADD REPLY
0
Entering edit mode

We always use the lack of time as an excuse of reusing other libraries, but the fact is reimplementing the part we need does not take that much time. Most people only use a tiny bit of BioPerl in a short time. Reimplementing that bit would not take long. The more we practice, the more proficient we will be and the less time we will spend in future on all programs. Programming skills are like abilities while libraries like knowledges. It is not hard to pick up knowledges, but it takes much longer to build up abilities.

ADD REPLY
0
Entering edit mode

On the other hand, I buy the argument that many people who have a great scientific mind may not be born to be a programmer -- people are different. Bio* projects are a great gift to them.

ADD REPLY
0
Entering edit mode

@Giovanni: good point. Of course, I won't re-invent the blast or any other complex algorithm (hum.. in fact, I tried... ). But I will think twice before using an external library (dependency++) to invoke a web-service when I can generate it on the fly with wsimport. etc... Furthermore , I believe that we, as developers/scientists, have to spend some time in acquiring new technical/scientific knowledge.

ADD REPLY
5
Entering edit mode
14.0 years ago

I definitely trust Bio* projects. If you are considering building your own software from scratch rather than relying on a community effort such as BioPerl, you must consider how you as a group will match up to the community in terms of

  • creating re-usable modules/classes
  • creating helpful documentation (this is critical even if you will only be using the code in-house)
  • thorough testing

In each of these ways, it's hard to imagine how a small group effort will do a better job than the community effort--especially with the last bullet point. If you have a large community using the same code, it's much more likely that problems will be identified and fixed quickly.

In addition to the projects you listed, there is one more you may want to consider. GenomeTools is very clean and efficient, and provides developer tools for writing software in C, Python, Ruby, and Lua. Coincidentally, they are a small research group from the Center for Bioinformatics at the University of Hamburg, but they do take a community-based approach to software development.

ADD COMMENT
4
Entering edit mode
14.0 years ago
Nico Adams ▴ 460

The only other thing I would like to add to the above comments about speed and re-use is simply that the code needs to be open. Scientists should always be sceptical, never trust anything and verify where possible. With large bioprojects you can often have at least some confidence that comes with a community of users that exists around a code and which through continued and active use do perform a kind of testing and validation. But you always want to be in a position to ascertain why code behaves in a certain way etc.... It may be obvious, but there is a lot of commercial software out there where you cannot do that and you have to trust the vendor blindly. While often convenient, this is, of course, ultimately unscientific.... I clearly remember some chemoinformatics code, sold by a large commercial vendor, which had a bug in one of the algorithms used to compute a particular descriptor - which wasn't fixed for almost 10 years. The community knew about it, but couldn't do anything about it. And it only came to light when someone compared the results of of several implementations of the algorithm.

I think open is a key point for developing trust.

ADD COMMENT
2
Entering edit mode
14.0 years ago
Neilfws 49k

Some excellent answers have addressed the pros/cons of using Bio* libraries and the issue of trust.

In terms of how "advanced" the projects are: Bioperl is by far the largest (in terms of both modules and community). This is in part due to its age; it's been around since the early 1990s, when Perl was considered "the language" for bioinformatics and so gathered a large, early following.

BioPython and BioRuby are probably similar in terms of community and code base size - BioRuby may be gaining the edge more recently. They each have issues. BioPython has struggled to create a cohesive community; it seems to me that Python programmers prefer to "roll their own". BioRuby has suffered from poor documentation and a general lack of Ruby awareness (it was barely known outside of Japan for many years, until Rails came on the scene).

Other Bio* projects are much smaller and it seems to me, not widely used.

ADD COMMENT
1
Entering edit mode

Bio* projects are those under the umbrella of the Open Bioinformatics Foundation - http://www.open-bio.org/wiki/Projects - of which Bioconductor is not one.

ADD REPLY
0
Entering edit mode

Well not sure if Bioconductor is a Bio*. If so is probably the best organized and documented of any of them.

ADD REPLY
1
Entering edit mode
14.0 years ago

Haven't you been reading my blog?? (Well, no worries, millions of peoples have not either ;) The proper question here is, "Where are the BioXXX validation reports?"

ADD COMMENT
0
Entering edit mode

I think the bioruby codebase is quite extensively tested (with unit tests).

ADD REPLY

Login before adding your answer.

Traffic: 3498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6