What Are The Advantages/Disadvantages Of One-Based Vs. Zero-Based Genome Coordinate Systems
4
29
Entering edit mode
13.7 years ago

One of the most common gotchas I encounter introducing students to bioinformatics is the off-by-one coordinate shift problem(s) that arise when switching between one-based (e.g. BLAST) and zero-based (e.g. UCSC) genome coordinate systems.

I have yet to find a clear exposition of the differences between these two major coordinate systems (and their minor variants), and have tried to discuss the differences in a past blog post, but I don't feel confident I've covered all the bases on this issue.

The fact that this is not an obvious problem to all has come up in recent a BioStar post and comments, and I was hoping that we could use this forum to discuss the relative merits of both systems.

coordinates genome • 25k views
ADD COMMENT
6
Entering edit mode

https://twitter.com/#!/dasmoth/status/42189749825449985

"If it doesn't have off-by-one errors, it isn't bioinformatics."

ADD REPLY
43
Entering edit mode
13.7 years ago

0-based, half open systems allow cheap length calculations. That is, m-n instead of (m-n)+1 in a 1-based, closed system. Also, 0-based is convenient for programming; most widely-used programming languages use 0-based arrays. Another example is calculating overlap. To calculate the degree of overlap between two 0-based, half-open intervals, you can use the following:

a = [start1, end1)
b = [start2, end2)
overlap(a,b) = min(end1,end2) - max(start1,start2)

whereas with a one-based system it is:

a = [start1, end1]
b = [start2, end2]
overlap(a,b) = min(end1,end2) - max(start1,start2) + 1

The beauty of the above approach with 0-based is that if two intervals do not overlap, then the recipe will return a negative value whose absolute value is the distance between the two features.

So, for programming, I much prefer 0-based, as it prevents tons of extra (ugly and more expensive) -1 and +1 operations in one's code.

The counter argument is that our brains are trained to think in 1-based, closed systems. I suspect the designers of various formats such as BED (0-based), BAM (0-based), VCF (1-based), and GFF (1-based) made conscious decisions regarding the coordinate system based on the intent of the format. For example, BED is a fundamental format in the UCSC browser and much of the underlying code depends on it. Thus, the coordinate system is 0-based for speed and code cleanliness. Similarly, BAM requires efficiency. In contrast, perhaps the designed of VCF and GFF were more concerned with "readability" of the format?

ADD COMMENT
4
Entering edit mode

And in a zero-based system, the start/end/length calculations still work for sequence features that pass across the origin of circular sequences.

ADD REPLY
2
Entering edit mode

SAM is 1-based. BAM is 0-based.

ADD REPLY
1
Entering edit mode

I more like to use [start1,end1) in the 0-based system. It is a different interpretation, but also a confusion. The 1-based coordinate has no such ambiguity.

ADD REPLY
1
Entering edit mode

Great answer. I don't buy the "our brains are trained to think in 1-based, closed systems" argument, though. That may be true, but I don't think it's relevant. In my experience, it's rare to have features that start at the origin. That means that human beings hardly ever have to count from the origin in bioinformatics; it's always software that's doing it. So we should choose coordinate systems that make it easier for software.

ADD REPLY
1
Entering edit mode

Just to add that SAM/BAM is one-based, not zero-based. See http://samtools.sourceforge.net/SAM1.pdf

I reached this page when googling to find out whether BAM was zero based and got the wrong answer.

ADD REPLY
0
Entering edit mode

the problem is, regarding widely used formats, BED and GFF, one can just use columns 4 and 5 in GFF file to generate a BED file and then coordinates will be shifted one basepair. An easy to make mistake!

ADD REPLY
0
Entering edit mode

Also: Keith James is totally right. But it's not just for circular genomes: flybase has some features that are in negative coordinates (don't know for sure why; I believe they're chromosome bands that have been mapped to locations before the sequenced region).

ADD REPLY
0
Entering edit mode

Also: the "interbase" interpretation of zero-based, half-open intervals makes it easier to describe indels.

ADD REPLY
0
Entering edit mode

In BEDTools user manual, (under section 1.3.4) they mentioned BED starts are zero-based and BED ends are one-based. How does this differ from the basic zero-based system?

ADD REPLY
0
Entering edit mode

Not all widely used programming languages are 0-based. Two that are very commonly used in bioinformatics, that are 1-based, are XSLT and XQuery. (You could count these as only one, since they are both based on XPath, which is where the 1-based arrays are defined.) This list on Wikipedia has a few others.

ADD REPLY
0
Entering edit mode

Not all widely used programming languages are 0-based. Two that are very commonly used in bioinformatics, that are 1-based, are XSLT and XQuery. (You could count these as only one, since they are both based on XPath, which is where the 1-based arrays are defined.) This list on Wikipedia has a few others.

ADD REPLY
0
Entering edit mode

Thanks for the ref Chris, edited accordingly.

ADD REPLY
8
Entering edit mode
13.1 years ago
Friend ▴ 80

Edsger Dijkstra has something to say about that: http://www.cs.utexas.edu/~EWD/ewd08xx/EWD831.PDF

ADD COMMENT
2
Entering edit mode

That was a pleasure to read. Thank you.

ADD REPLY
3
Entering edit mode
12.6 years ago
Pascal ▴ 160

Here is a very good explanation about the different coordination conventions.

Also very good is an overview of conventions used by file formats and data bases (from the same blog).

ADD COMMENT
1
Entering edit mode
13.7 years ago

I would say instead that what you really intended to ask is, is "what are the disadvantages of index-based coordinate systems". Because to me, zero or one is just a choice and neither more intuitive than the other. Moreover, the one-off problem is something you will have with any starting point, and is not uncommonly caused by the last index, in addition to the first.

More intuitive are solutions like:

foreach (nucleotide : dnaSequence) {
  ...
}
ADD COMMENT
0
Entering edit mode

But how do I easily access the 5th to 10th bases?

ADD REPLY
0
Entering edit mode

That's cheating... now your question involved indices... :)

ADD REPLY

Login before adding your answer.

Traffic: 1815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6