Most softwares (C, java...) use a int32 (unsigned or not) to store the length of the chromosome. It isn't enough when the length of a chromosome is greater than INT_MAX or UINT_MAX
Not yet urgent for the SAM/BAM specification, but while moving from int32 to uint32 could be backwards compatible it will be unpopular with the Java programmers. See samtools-devel discussion, September 2011 thread "Reference length limit differs between SAM & BAM spec?"
The point of my question was do we already know of real chromosomes where we need more than uint32?
i dont know of a website that has all the chromosome sizes but i believe it is possible that tree genomes (if they ever get sequenced) might have bigger chromosomes than wheat.
As an aside, in the exceptional case where someone would have to deal with a chromosome of length longer than INT_MAX, splitting the chromosome in two (Or three, or four!) would be a viable solution.
I think it would be a much more clean solution to use a Long/long int in such a case. Splitting could easily mess up, e.g. you have to split at locations which are not overlapped by any region and keep trac of it. Then, to reconstruct the original chromosome positions, you still need to use Long.
I just remembered reading an article on the largest genome size. It claims the flower named Paris Japonica has 150 billion bases over 40 chromosomes,
I would imagine that the largest chromosome is bigger than the UINT_MAX. Alas a quick search did not turn up any sequence information. Most likely the size of the DNA was measured by other means.
Second question: Amoeba dubia, which if I recall correctly is also known as Polychaos dubium, has a genome that is 670 gigabases. While I know nothing about how this is divided among chromosomes, it is reasonable to suspect that a given chromosome would exceed an int32 or even a uint32.
T. Cavalier-Smith, Introduction: the evolutionary significance of genome size, T. Cavalier-Smith, Editor, The Evolution of Genome Size, John Wiley & Sons (1985), pp. 1–36.
C.T. Friz, The biochemical composition of the free-living Amoebae Chaos chaos, Amoeba dubia and Amoeba proteus. Comp. Biochem. Physiol., 26 (1968), pp. 81–90.
(note that these are genome sizes not chromosome sizes but large genome size and low chromosome count might mean chromosomes are large assuming most of genome isn't mitochondrial DNA or something)
Just tested this with R and IRanges and found that the restriction of maxint is in effect there too.
IRanges seemingly stores its coordinates in integer variables, therefore, genomic coordinates exceeding maxint cannot be used. That would allow for a maximal chromosome size of ~2.1 Gbase right?
> bign = .Machine$integer.max
> IRanges(start=1, end=bign)
IRanges of length 1
start end width
[1] 1 NA 2147483647
Warning message:
In start(x) + width(x): NAs produced by integer overflow
> IRanges(start=1, end=bign+1)
Error in solveUserSEW0(start = start, end = end, width = width):
solving row 1: range cannot be determined from the supplied arguments (too many NAs)
In addition: Warning message:
In .normargSEW0(end, "end"): NAs introduced by coercion
>
Not yet urgent for the SAM/BAM specification, but while moving from int32 to uint32 could be backwards compatible it will be unpopular with the Java programmers. See samtools-devel discussion, September 2011 thread "Reference length limit differs between SAM & BAM spec?"
The point of my question was do we already know of real chromosomes where we need more than uint32?
i dont know of a website that has all the chromosome sizes but i believe it is possible that tree genomes (if they ever get sequenced) might have bigger chromosomes than wheat.
As an aside, in the exceptional case where someone would have to deal with a chromosome of length longer than INT_MAX, splitting the chromosome in two (Or three, or four!) would be a viable solution.
I think it would be a much more clean solution to use a Long/long int in such a case. Splitting could easily mess up, e.g. you have to split at locations which are not overlapped by any region and keep trac of it. Then, to reconstruct the original chromosome positions, you still need to use Long.
In a C/C++ program, the best practice would be to use a
typedef uint32 chrom_length_t
. You'll just have to change uint32 to uint64 or whatever later.You'll never need more than 64kB of memory.