MACS has the following as part of its help text:
-g GSIZE, --gsize=GSIZE
Effective genome size. It can be 1.0e+9 or 1000000000,
or shortcuts:'hs' for human (2.7e9), 'mm' for mouse
(1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
fruitfly (1.2e8), Default:hs
Specifically, note that the human genome size is given as 2.7 billion bases. However, my understanding is that the human genome size is more like 3 billion bases. When I add up all the chromosome lengths from the UCSC fetchChromSizes
utility, I get about 3.1 billion:
$ fetchChromSizes hg19 2>/dev/null | perl -lanE 'BEGIN { say my $sum = 0; } $sum += int $F[1]; END { say "Total: $sum" }'
Total: 3137161264
Even just considering the "main" chromosomes and skipping the "chrUn" and the "_random" ones, the total is still over 3 billion.
So where does MACS's 2.7 billion figure come from? Is this genome size adjusted for repetitiveness or mappability or something?
I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (the primary assembly only) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.
I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (chromosomes plus unplaced/unlocalized contigs) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.