Biojava, Reason Indexes Are Inclusive One Based?
1
0
Entering edit mode
12.4 years ago
Nickengland ▴ 130

In Biojava, and I assume other Bio* projects? Indexes for sequences positions using the standard API are 1-based inclusive rather than the more standard 0-based exclusive on the top end. This means if you are extracting a subsequence up to a certain number of leading bases which match something else, say, you end up having to be careful , as you can't call .subSeqeuence(0,-1) which throws an exception rather than returning an empty sequence (as .substring(0,0) would).

Is there a particular reason for this design choice? It seems confusing as it is counter-intuitive to other programming indexing systems.

biojava • 2.2k views
ADD COMMENT
0
Entering edit mode

I don't think there really is a reason it chose to use 1-based arrays. Perhaps because gene sequences are 1-based inclusive and they wanted arrays to be reflective of that.

ADD REPLY
0
Entering edit mode

Really? Biopython sequences are definitely zero-based, NOT one-based. This is to follow Python conventions.

ADD REPLY
1
Entering edit mode
12.4 years ago

The choice of indexing is the source of quite a bit of contention - from a programming perspective 0 based, open end indexing is far more preferable. From a user interface perspective and when communicating with life scientists a 1 based inclusive indexing is absolutely required.

Being off by one is an endemic problem in bioinformatics and has probably already caused tens if not hundreds of millions of dollars in wasted resources (incorrect results etc).

Historically numerical oriented languages such as Fortran and for example R are one based. General programming oriented languages such as C, Java, Perl, Python are zero based. Personally I believe that Bio* projects should use one based indexing as it is the lesser of two evils, it seems the BioJava developers think the same way.

ADD COMMENT
1
Entering edit mode

I believe that Bio* projects should use one based indexing as it is the lesser of two evils

Really? I'd say that since the bio* projects are for developers, 0-based indexing makes more sense. Then it's just a matter of converting to 1-based when showing to end-users.

ADD REPLY
0
Entering edit mode

I'd say they should follow the string convention of the language concerned. In the case of Biopython, sequences and feature coordinates on them are zero-based as in Python strings/lists/arrays.

ADD REPLY
0
Entering edit mode

That makes sense.

It is just that the difficulty of train life scientist to use programming concepts while littering the way with time bombs like this makes the task a lot more difficult than it needs to be. Having a zero based system is like starting with a wall instead of a mellow slope. I get occasionally tripped up when switching languages - and it is one of the most devious of errors as many times it remains hidden.

ADD REPLY

Login before adding your answer.

Traffic: 1721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6