Question

Biojava, Reason Indexes Are Inclusive One Based?

0

Entering edit mode

13.0 years ago

Nickengland ▴ 130

In Biojava, and I assume other Bio* projects? Indexes for sequences positions using the standard API are 1-based inclusive rather than the more standard 0-based exclusive on the top end. This means if you are extracting a subsequence up to a certain number of leading bases which match something else, say, you end up having to be careful , as you can't call .subSeqeuence(0,-1) which throws an exception rather than returning an empty sequence (as .substring(0,0) would).

Is there a particular reason for this design choice? It seems confusing as it is counter-intuitive to other programming indexing systems.

biojava • 2.4k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 13.0 years ago by Nickengland ▴ 130

0

Entering edit mode

I don't think there really is a reason it chose to use 1-based arrays. Perhaps because gene sequences are 1-based inclusive and they wanted arrays to be reflective of that.

ADD REPLY • link 13.0 years ago by Damian Kao 16k

0

Entering edit mode

Really? Biopython sequences are definitely zero-based, NOT one-based. This is to follow Python conventions.

ADD REPLY • link 10.5 years ago by Peter 6.0k

Ram · Answer 1 · 2012-07-26

1

Entering edit mode

13.0 years ago

Istvan Albert 102k

The choice of indexing is the source of quite a bit of contention - from a programming perspective 0 based, open end indexing is far more preferable. From a user interface perspective and when communicating with life scientists a 1 based inclusive indexing is absolutely required.

Being off by one is an endemic problem in bioinformatics and has probably already caused tens if not hundreds of millions of dollars in wasted resources (incorrect results etc).

Historically numerical oriented languages such as Fortran and for example R are one based. General programming oriented languages such as C, Java, Perl, Python are zero based. Personally I believe that Bio* projects should use one based indexing as it is the lesser of two evils, it seems the BioJava developers think the same way.

ADD COMMENT • link 13.0 years ago by Istvan Albert 102k

1

Entering edit mode

I believe that Bio* projects should use one based indexing as it is the lesser of two evils

Really? I'd say that since the bio* projects are for developers, 0-based indexing makes more sense. Then it's just a matter of converting to 1-based when showing to end-users.

ADD REPLY • link 13.0 years ago by brentp 24k

0

Entering edit mode

I'd say they should follow the string convention of the language concerned. In the case of Biopython, sequences and feature coordinates on them are zero-based as in Python strings/lists/arrays.

ADD REPLY • link 10.5 years ago by Peter 6.0k

0

Entering edit mode

That makes sense.

It is just that the difficulty of train life scientist to use programming concepts while littering the way with time bombs like this makes the task a lot more difficult than it needs to be. Having a zero based system is like starting with a wall instead of a mellow slope. I get occasionally tripped up when switching languages - and it is one of the most devious of errors as many times it remains hidden.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Istvan Albert 102k