Question

What Are The Advantages/Disadvantages Of One-Based Vs. Zero-Based Genome Coordinate Systems

29

Entering edit mode

13.8 years ago

Casey Bergman 18k

One of the most common gotchas I encounter introducing students to bioinformatics is the off-by-one coordinate shift problem(s) that arise when switching between one-based (e.g. BLAST) and zero-based (e.g. UCSC) genome coordinate systems.

I have yet to find a clear exposition of the differences between these two major coordinate systems (and their minor variants), and have tried to discuss the differences in a past blog post, but I don't feel confident I've covered all the bases on this issue.

The fact that this is not an obvious problem to all has come up in recent a BioStar post and comments, and I was hoping that we could use this forum to discuss the relative merits of both systems.

coordinates genome • 25k views

ADD COMMENT • link updated 10.1 years ago by Biostar 20 • written 13.8 years ago by Casey Bergman 18k

6

Entering edit mode

https://twitter.com/#!/dasmoth/status/42189749825449985

"If it doesn't have off-by-one errors, it isn't bioinformatics."

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 13.8 years ago by Pierre Lindenbaum 164k

Ram · Answer 1 · 2011-03-13

44

Entering edit mode

13.8 years ago

Aaronquinlan 12k

0-based, half open systems allow cheap length calculations. That is, m-n instead of (m-n)+1 in a 1-based, closed system. Also, 0-based is convenient for programming; most widely-used programming languages use 0-based arrays. Another example is calculating overlap. To calculate the degree of overlap between two 0-based, half-open intervals, you can use the following:

a = [start1, end1)
b = [start2, end2)
overlap(a,b) = min(end1,end2) - max(start1,start2)

whereas with a one-based system it is:

a = [start1, end1]
b = [start2, end2]
overlap(a,b) = min(end1,end2) - max(start1,start2) + 1

The beauty of the above approach with 0-based is that if two intervals do not overlap, then the recipe will return a negative value whose absolute value is the distance between the two features.

So, for programming, I much prefer 0-based, as it prevents tons of extra (ugly and more expensive) -1 and +1 operations in one's code.

The counter argument is that our brains are trained to think in 1-based, closed systems. I suspect the designers of various formats such as BED (0-based), BAM (0-based), VCF (1-based), and GFF (1-based) made conscious decisions regarding the coordinate system based on the intent of the format. For example, BED is a fundamental format in the UCSC browser and much of the underlying code depends on it. Thus, the coordinate system is 0-based for speed and code cleanliness. Similarly, BAM requires efficiency. In contrast, perhaps the designed of VCF and GFF were more concerned with "readability" of the format?

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.8 years ago by Aaronquinlan 12k

4

Entering edit mode

And in a zero-based system, the start/end/length calculations still work for sequence features that pass across the origin of circular sequences.

ADD REPLY • link 13.8 years ago by biobot 0.0.77.a.1099 6.2k

2

Entering edit mode

SAM is 1-based. BAM is 0-based.

ADD REPLY • link 12.8 years ago by Aaronquinlan 12k

1

Entering edit mode

I more like to use [start1,end1) in the 0-based system. It is a different interpretation, but also a confusion. The 1-based coordinate has no such ambiguity.

ADD REPLY • link 13.8 years ago by lh3 33k

1

Entering edit mode

Great answer. I don't buy the "our brains are trained to think in 1-based, closed systems" argument, though. That may be true, but I don't think it's relevant. In my experience, it's rare to have features that start at the origin. That means that human beings hardly ever have to count from the origin in bioinformatics; it's always software that's doing it. So we should choose coordinate systems that make it easier for software.

ADD REPLY • link 13.8 years ago by Mitch Skinner ▴ 660

1

Entering edit mode

Just to add that SAM/BAM is one-based, not zero-based. See http://samtools.sourceforge.net/SAM1.pdf

I reached this page when googling to find out whether BAM was zero based and got the wrong answer.

ADD REPLY • link 12.8 years ago by Fidel ★ 2.0k

0

Entering edit mode

the problem is, regarding widely used formats, BED and GFF, one can just use columns 4 and 5 in GFF file to generate a BED file and then coordinates will be shifted one basepair. An easy to make mistake!

ADD REPLY • link 13.8 years ago by Alper Yilmaz ▴ 100

0

Entering edit mode

Also: Keith James is totally right. But it's not just for circular genomes: flybase has some features that are in negative coordinates (don't know for sure why; I believe they're chromosome bands that have been mapped to locations before the sequenced region).

ADD REPLY • link 13.8 years ago by Mitch Skinner ▴ 660

0

Entering edit mode

Also: the "interbase" interpretation of zero-based, half-open intervals makes it easier to describe indels.

ADD REPLY • link 13.8 years ago by Mitch Skinner ▴ 660

0

Entering edit mode

In BEDTools user manual, (under section 1.3.4) they mentioned BED starts are zero-based and BED ends are one-based. How does this differ from the basic zero-based system?

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 13.2 years ago by Rahul ▴ 40

0

Entering edit mode

Not all widely used programming languages are 0-based. Two that are very commonly used in bioinformatics, that are 1-based, are XSLT and XQuery. (You could count these as only one, since they are both based on XPath, which is where the 1-based arrays are defined.) This list on Wikipedia has a few others.

ADD REPLY • link 12.7 years ago by Chris Maloney ▴ 360

0

Entering edit mode

Not all widely used programming languages are 0-based. Two that are very commonly used in bioinformatics, that are 1-based, are XSLT and XQuery. (You could count these as only one, since they are both based on XPath, which is where the 1-based arrays are defined.) This list on Wikipedia has a few others.

ADD REPLY • link 12.7 years ago by Chris Maloney ▴ 360

0

Entering edit mode

Thanks for the ref Chris, edited accordingly.

ADD REPLY • link 12.7 years ago by Aaronquinlan 12k

score 8 · Answer 2 · 2011-10-23

8

Entering edit mode

13.2 years ago

Friend ▴ 80

Edsger Dijkstra has something to say about that: http://www.cs.utexas.edu/~EWD/ewd08xx/EWD831.PDF

ADD COMMENT • link 13.2 years ago by Friend ▴ 80

2

Entering edit mode

That was a pleasure to read. Thank you.

ADD REPLY • link 12.7 years ago by Aaronquinlan 12k

Ram · Answer 3 · 2012-04-03

3

Entering edit mode

12.7 years ago

Pascal ▴ 160

Here is a very good explanation about the different coordination conventions.

Also very good is an overview of conventions used by file formats and data bases (from the same blog).

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 12.7 years ago by Pascal ▴ 160

Ram · Answer 4 · 2011-03-13

1

Entering edit mode

13.8 years ago

Egon Willighagen 5.4k

I would say instead that what you really intended to ask is, is "what are the disadvantages of index-based coordinate systems". Because to me, zero or one is just a choice and neither more intuitive than the other. Moreover, the one-off problem is something you will have with any starting point, and is not uncommonly caused by the last index, in addition to the first.