This question was inspired by Peter Cock 's tweet
Ahem. Anyone know what longest known chromosome is? Is it *wheat* chromosome 3B at 995 Mbp (almost 1Gbp, a billion)? http://t.co/rvIWX1NC
— Peter Cock (@pjacock) September 25, 2011
Most softwares (C, java...) use a int32 (unsigned or not) to store the length of the chromosome. It isn't enough when the length of a chromosome is greater than INT_MAX or UINT_MAX
# define INT_MAX 2147483647
# define UINT_MAX 4294967295U
So my question is:
- Is there any resource where one can find the length of the chromosomes. Something like BioNumbers.
- What's the length of the longest chromosome ?
Not yet urgent for the SAM/BAM specification, but while moving from int32 to uint32 could be backwards compatible it will be unpopular with the Java programmers. See samtools-devel discussion, September 2011 thread "Reference length limit differs between SAM & BAM spec?"
The point of my question was do we already know of real chromosomes where we need more than uint32?
i dont know of a website that has all the chromosome sizes but i believe it is possible that tree genomes (if they ever get sequenced) might have bigger chromosomes than wheat.
As an aside, in the exceptional case where someone would have to deal with a chromosome of length longer than INT_MAX, splitting the chromosome in two (Or three, or four!) would be a viable solution.
I think it would be a much more clean solution to use a Long/long int in such a case. Splitting could easily mess up, e.g. you have to split at locations which are not overlapped by any region and keep trac of it. Then, to reconstruct the original chromosome positions, you still need to use Long.
In a C/C++ program, the best practice would be to use a
typedef uint32 chrom_length_t
. You'll just have to change uint32 to uint64 or whatever later.You'll never need more than 64kB of memory.