Question

Reference genome size for genome size estimation.

0

Entering edit mode

6 months ago

Shakunthala Natarajan • 0

I am comparing the performance of different kmer-based genome size estimation tools. However, to standardise, I am not quite sure about the reference genome size that I can use for every organism. For instance, the expected size of Arabidopsis thaliana is ~135 Mbp, but one of the largest reported assembly sizes for A. thaliana is about 148 Mbp. Should the largest reported assembly size be taken as the standard for evaluating the performance of these different size estimators, or should the most expected genome size be used? Or is there a better number that I can probably use? I wanted to understand this better since some tools seem to underestimate the genome size if the largest reported assembly is taken as the comparison standard. It would be great if someone could help me with this. Thank you!

genome-size • 515 views

ADD COMMENT • link updated 6 months ago by Ram 45k • written 6 months ago by Shakunthala Natarajan • 0

score 1 · Accepted Answer · 2024-10-13

Normally you use flow cytometry-based c-values as the gold standard to compare your bioinformatics-based estimates to.

Kew's c-values database https://cvalues.science.kew.org/search lists 0.16 pg for ecotype Columbia (Col), which is 0.16 * 978 Mbp = 156.48 Mbp. It's even in the paper's title: 'Comparisons with Caenorhabditis (~100 Mb) and Drosophila (~175 Mb) Using Flow Cytometry Show Genome Size in Arabidopsis to be ~157 Mb and thus ~25 % Larger than the Arabidopsis Genome Initiative Estimate of ~125 Mb'. That's good enough for any paper or report, but also a chance to quickly discuss genome size variations - Kew's database goes up to 0.44 pg so there's substantial genome size variation.

The original A. thaliana genome paper lists estimates from 80 Mbp to 150 Mb https://www.pnas.org/doi/pdf/10.1073/pnas.92.24.10831