The Longest Chromosome > Sizeof(Int32)
7
17
Entering edit mode
13.2 years ago

This question was inspired by Peter Cock 's tweet

Most softwares (C, java...) use a int32 (unsigned or not) to store the length of the chromosome. It isn't enough when the length of a chromosome is greater than INT_MAX or UINT_MAX

#  define INT_MAX    2147483647
#  define UINT_MAX    4294967295U

So my question is:

  • Is there any resource where one can find the length of the chromosomes. Something like BioNumbers.
  • What's the length of the longest chromosome ?
chromosome database length • 12k views
ADD COMMENT
1
Entering edit mode

Not yet urgent for the SAM/BAM specification, but while moving from int32 to uint32 could be backwards compatible it will be unpopular with the Java programmers. See samtools-devel discussion, September 2011 thread "Reference length limit differs between SAM & BAM spec?"

The point of my question was do we already know of real chromosomes where we need more than uint32?

ADD REPLY
0
Entering edit mode

i dont know of a website that has all the chromosome sizes but i believe it is possible that tree genomes (if they ever get sequenced) might have bigger chromosomes than wheat.

ADD REPLY
0
Entering edit mode

As an aside, in the exceptional case where someone would have to deal with a chromosome of length longer than INT_MAX, splitting the chromosome in two (Or three, or four!) would be a viable solution.

ADD REPLY
0
Entering edit mode

I think it would be a much more clean solution to use a Long/long int in such a case. Splitting could easily mess up, e.g. you have to split at locations which are not overlapped by any region and keep trac of it. Then, to reconstruct the original chromosome positions, you still need to use Long.

ADD REPLY
0
Entering edit mode

In a C/C++ program, the best practice would be to use a typedef uint32 chrom_length_t. You'll just have to change uint32 to uint64 or whatever later.

ADD REPLY
0
Entering edit mode

You'll never need more than 64kB of memory.

ADD REPLY
6
Entering edit mode
13.2 years ago

I just remembered reading an article on the largest genome size. It claims the flower named Paris Japonica has 150 billion bases over 40 chromosomes,

I would imagine that the largest chromosome is bigger than the UINT_MAX. Alas a quick search did not turn up any sequence information. Most likely the size of the DNA was measured by other means.

ADD COMMENT
4
Entering edit mode
13.2 years ago

Second question: Amoeba dubia, which if I recall correctly is also known as Polychaos dubium, has a genome that is 670 gigabases. While I know nothing about how this is divided among chromosomes, it is reasonable to suspect that a given chromosome would exceed an int32 or even a uint32.

Refs:

  1. http://www.sciencedirect.com/science/article/pii/S0169534703003239
  2. T. Cavalier-Smith, Introduction: the evolutionary significance of genome size, T. Cavalier-Smith, Editor, The Evolution of Genome Size, John Wiley & Sons (1985), pp. 1–36.
  3. C.T. Friz, The biochemical composition of the free-living Amoebae Chaos chaos, Amoeba dubia and Amoeba proteus. Comp. Biochem. Physiol., 26 (1968), pp. 81–90.
ADD COMMENT
4
Entering edit mode
13.2 years ago
Ying W ★ 4.3k

This old poster gives highest genome size of 130 billion base pairs for lungfish

Biochemical measurements suggest that certain Amoeba might have larger genomes

Also see references for this article and this one.

(note that these are genome sizes not chromosome sizes but large genome size and low chromosome count might mean chromosomes are large assuming most of genome isn't mitochondrial DNA or something)

ADD COMMENT
2
Entering edit mode
10.4 years ago
Peter 6.0k

With the publication of A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome, the initial release of the Wheat ta 3B chromosome is 774 Mbp (or to be exact, 774434471bp in file https://urgi.versailles.inra.fr/download/wheat/3B/ta3bPseudomolecule.genom.fa.gz)

ADD COMMENT
3
Entering edit mode
13.2 years ago
Michael 55k

Just tested this with R and IRanges and found that the restriction of maxint is in effect there too.

IRanges seemingly stores its coordinates in integer variables, therefore, genomic coordinates exceeding maxint cannot be used. That would allow for a maximal chromosome size of ~2.1 Gbase right?

> bign = .Machine$integer.max
> IRanges(start=1, end=bign)
IRanges of length 1
    start end      width
[1]     1  NA 2147483647
Warning message:
In start(x) + width(x) : NAs produced by integer overflow
> IRanges(start=1, end=bign+1)
Error in solveUserSEW0(start = start, end = end, width = width) : 
  solving row 1: range cannot be determined from the supplied arguments (too many NAs)
In addition: Warning message:
In .normargSEW0(end, "end") : NAs introduced by coercion
>
ADD COMMENT
2
Entering edit mode
6.9 years ago

The recently sequenced tulip genome would appear to have chromosomes larger than INT_MAX, and possibly UINT_MAX. https://www.hortipoint.nl/floribusiness/dutch-consortium-unravels-first-tulip-genome/

“The tulip genome makes the human genome look tiny: the entire human genome fits into one tulip chromosome..."

ADD COMMENT
2
Entering edit mode
5.8 years ago
michael.ante ★ 3.9k

Hi,

I'd like to give you a bit of insight, working with such large genomes. I recently analysed wheat RNA-SEQ data and hit several tools' limits.

The STAR aligner index generation required more than 126 GB of RAM - even with the sparse settings.

The BAM indexing required using the CSI index.

RSeQC's internal index was not able to cope with the large annotation.

I guess that a lot of tools are designed for model species' genome sizes.

Cheers,

Michael

[EDIT] removed errors

ADD COMMENT

Login before adding your answer.

Traffic: 1405 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6