Why Does Macs Use A Genome Size Of 2.7 Billion Instead Of 3 Billion For Human?
2
9
Entering edit mode
12.7 years ago
Ryan Thompson ★ 3.6k

MACS has the following as part of its help text:

  -g GSIZE, --gsize=GSIZE
                        Effective genome size. It can be 1.0e+9 or 1000000000,
                        or shortcuts:'hs' for human (2.7e9), 'mm' for mouse
                        (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
                        fruitfly (1.2e8), Default:hs

Specifically, note that the human genome size is given as 2.7 billion bases. However, my understanding is that the human genome size is more like 3 billion bases. When I add up all the chromosome lengths from the UCSC fetchChromSizes utility, I get about 3.1 billion:

$ fetchChromSizes hg19 2>/dev/null | perl -lanE 'BEGIN { say my $sum = 0; } $sum += int $F[1]; END { say "Total: $sum" }'
Total: 3137161264

Even just considering the "main" chromosomes and skipping the "chrUn" and the "_random" ones, the total is still over 3 billion.

So where does MACS's 2.7 billion figure come from? Is this genome size adjusted for repetitiveness or mappability or something?

macs genome • 9.5k views
ADD COMMENT
0
Entering edit mode

I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (the primary assembly only) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.

ADD REPLY
0
Entering edit mode

I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (chromosomes plus unplaced/unlocalized contigs) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.

ADD REPLY
11
Entering edit mode
12.7 years ago

Is this genome size adjusted for repetitiveness or mappability or something?

Yes.

See the MACS manual:

It's the mappable genome size or effective genome size which is defined as the genome size which can be sequenced. Because of the repetitive features on the chromsomes, the actual mappable genome size will be smaller than the original size, about 90% or 70% of the genome size. The default hs — 2.7e9 is recommended for UCSC human hg18 assembly. Here are all precompiled parameters for effective genome size

http://liulab.dfci.harvard.edu/MACS/00README.html

ADD COMMENT
0
Entering edit mode

If I took all the long runs of N's out of the human genome sequences and counted the number of remaining nucleotides, would I get about 2.7e9? Or is it more complicated than that? Does repetitiveness or redundancy come into play (for example, do you only the X/Y pseudoautosomal region once instead of twice)?

ADD REPLY
0
Entering edit mode

I have not looked at that myself - just took their word for it.

ADD REPLY
0
Entering edit mode

The PeakSeq publication has a table with different 'effective' genome sizes. See http://www.nature.com/nbt/journal/v27/n1/fig_tab/nbt.1518_T1.html

ADD REPLY
5
Entering edit mode
12.7 years ago
Fidel ★ 2.0k

MACS uses the 'effective' genome size to compute the probability of detecting a read, considering that only mappable reads were taken into account. Thus, it assumes that you left out from your mapping all reads that map to more than one position.

This parameter is also specific for the read length as the fraction of the genome that is unmappable, decreases for longer reads.

The PeakSeq publication has a table where they present different genome sizes together with the 'effective' or mappable fraction. See http://www.nature.com/nbt/journal/v27/n1/fig_tab/nbt.1518_T1.html This table contains the genome mappability of different common species considering reads of 30 nt. The table caption includes the mappable fraction of the human genome for different read lengths. Notably, the effective genome size of 2.7e9 corresponds to a read length of 60nt.

ADD COMMENT
2
Entering edit mode

I am not sure how the table is derived, but that is quite different from the number in my head. Excluding gaps, hg19 (chromosomes plus unplaced/unlocalized contigs) has only 2.86Gbp sequences, so 2.7e9 amounts to 94% of human genome, instead of 88.3% in the table legend. Furthermore, with 60bp single-end reads, we can hardly access 2.7Gb sequences. With 100bp paired-end, we can only confidently map reads to 94-95% of the genome.

ADD REPLY
0
Entering edit mode

Thanks for that. This is quite informative, especially because it shows that values to use for a given read size.

ADD REPLY

Login before adding your answer.

Traffic: 2804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6