Question

The Longest Chromosome > Sizeof(Int32)

17

Entering edit mode

13.5 years ago

Pierre Lindenbaum 165k

This question was inspired by Peter Cock 's tweet

Ahem. Anyone know what longest known chromosome is? Is it *wheat* chromosome 3B at 995 Mbp (almost 1Gbp, a billion)? http://t.co/rvIWX1NC
— Peter Cock (@pjacock) September 25, 2011

Most softwares (C, java...) use a int32 (unsigned or not) to store the length of the chromosome. It isn't enough when the length of a chromosome is greater than INT_MAX or UINT_MAX

#  define INT_MAX    2147483647
#  define UINT_MAX    4294967295U

So my question is:

Is there any resource where one can find the length of the chromosomes. Something like BioNumbers.
What's the length of the longest chromosome ?

chromosome database length • 13k views

ADD COMMENT • link updated 6.1 years ago by michael.ante ★ 4.0k • written 13.5 years ago by Pierre Lindenbaum 165k

1

Entering edit mode

Not yet urgent for the SAM/BAM specification, but while moving from int32 to uint32 could be backwards compatible it will be unpopular with the Java programmers. See samtools-devel discussion, September 2011 thread "Reference length limit differs between SAM & BAM spec?"

The point of my question was do we already know of real chromosomes where we need more than uint32?

ADD REPLY • link 13.4 years ago by Peter 6.0k

0

Entering edit mode

i dont know of a website that has all the chromosome sizes but i believe it is possible that tree genomes (if they ever get sequenced) might have bigger chromosomes than wheat.

ADD REPLY • link 13.5 years ago by Ying W ★ 4.3k

0

Entering edit mode

As an aside, in the exceptional case where someone would have to deal with a chromosome of length longer than INT_MAX, splitting the chromosome in two (Or three, or four!) would be a viable solution.

ADD REPLY • link 13.5 years ago by Eric Fournier ★ 1.4k

0

Entering edit mode

I think it would be a much more clean solution to use a Long/long int in such a case. Splitting could easily mess up, e.g. you have to split at locations which are not overlapped by any region and keep trac of it. Then, to reconstruct the original chromosome positions, you still need to use Long.

ADD REPLY • link 13.5 years ago by Michael 55k

0

Entering edit mode

In a C/C++ program, the best practice would be to use a typedef uint32 chrom_length_t. You'll just have to change uint32 to uint64 or whatever later.

ADD REPLY • link updated 6.1 years ago by Ram 45k • written 13.5 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

You'll never need more than 64kB of memory.

ADD REPLY • link 7.2 years ago by Matt Shirley 10k

3

Entering edit mode

13.5 years ago

Michael 55k

Just tested this with R and IRanges and found that the restriction of maxint is in effect there too.

IRanges seemingly stores its coordinates in integer variables, therefore, genomic coordinates exceeding maxint cannot be used. That would allow for a maximal chromosome size of ~2.1 Gbase right?

> bign = .Machine$integer.max
> IRanges(start=1, end=bign)
IRanges of length 1
    start end      width
[1]     1  NA 2147483647
Warning message:
In start(x) + width(x) : NAs produced by integer overflow
> IRanges(start=1, end=bign+1)
Error in solveUserSEW0(start = start, end = end, width = width) : 
  solving row 1: range cannot be determined from the supplied arguments (too many NAs)
In addition: Warning message:
In .normargSEW0(end, "end") : NAs introduced by coercion
>

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 13.5 years ago by Michael 55k

2

Entering edit mode

7.2 years ago

g.m.carstairs ▴ 20

The recently sequenced tulip genome would appear to have chromosomes larger than INT_MAX, and possibly UINT_MAX. https://www.hortipoint.nl/floribusiness/dutch-consortium-unravels-first-tulip-genome/

“The tulip genome makes the human genome look tiny: the entire human genome fits into one tulip chromosome..."

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 7.2 years ago by g.m.carstairs ▴ 20

2

Entering edit mode

6.1 years ago

michael.ante ★ 4.0k

Hi,

I'd like to give you a bit of insight, working with such large genomes. I recently analysed wheat RNA-SEQ data and hit several tools' limits.

The STAR aligner index generation required more than 126 GB of RAM - even with the sparse settings.

The BAM indexing required using the CSI index.

RSeQC's internal index was not able to cope with the large annotation.

I guess that a lot of tools are designed for model species' genome sizes.

Cheers,

Michael

[EDIT] removed errors

ADD COMMENT • link 6.1 years ago by michael.ante ★ 4.0k

Ram · Accepted Answer · 2011-09-28

6

Entering edit mode

13.5 years ago

Istvan Albert 102k

I just remembered reading an article on the largest genome size. It claims the flower named Paris Japonica has 150 billion bases over 40 chromosomes,

I would imagine that the largest chromosome is bigger than the UINT_MAX. Alas a quick search did not turn up any sequence information. Most likely the size of the DNA was measured by other means.

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 13.5 years ago by Istvan Albert 102k

Ram · Accepted Answer · 2011-09-28

Second question: Amoeba dubia, which if I recall correctly is also known as Polychaos dubium, has a genome that is 670 gigabases. While I know nothing about how this is divided among chromosomes, it is reasonable to suspect that a given chromosome would exceed an int32 or even a uint32.

Refs:

http://www.sciencedirect.com/science/article/pii/S0169534703003239
T. Cavalier-Smith, Introduction: the evolutionary significance of genome size, T. Cavalier-Smith, Editor, The Evolution of Genome Size, John Wiley & Sons (1985), pp. 1–36.
C.T. Friz, The biochemical composition of the free-living Amoebae Chaos chaos, Amoeba dubia and Amoeba proteus. Comp. Biochem. Physiol., 26 (1968), pp. 81–90.

Ram · Accepted Answer · 2011-09-29

This old poster gives highest genome size of 130 billion base pairs for lungfish

Biochemical measurements suggest that certain Amoeba might have larger genomes

Also see references for this article and this one.

(note that these are genome sizes not chromosome sizes but large genome size and low chromosome count might mean chromosomes are large assuming most of genome isn't mitochondrial DNA or something)

Ram · Accepted Answer · 2014-07-18

2

Entering edit mode

10.7 years ago

Peter 6.0k

With the publication of A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome, the initial release of the Wheat ta 3B chromosome is 774 Mbp (or to be exact, 774434471bp in file https://urgi.versailles.inra.fr/download/wheat/3B/ta3bPseudomolecule.genom.fa.gz)

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 10.7 years ago by Peter 6.0k