Rat genome is one of the genomes that are not mentioned very much among macs2 users. So I had to look around and do some work for myself. I want to share the output and also ask a related question.
I used rat genome rn6 downloaded from UCSC bigzip file and kept all contigs that do not have standard chromosome names. The calculation involved using gem-indexer
for base-space data and gem-mappability
with kmer size of 45, 50, 75, 100, and 150 bases. The effective genome size values are the number of '!' characters in the .mappability
files.
rn6.softmask.all_45: 2105347242
rn6.softmask.all_50: 2081721273
rn6.softmask.all_75: 2197995070
rn6.softmask.all_100: 2247394146
rn6.softmask.all_150: 2285452802
I also did some test with color-space index of only chromosome 1. The numbers looks greater than that from base-space index of chromosome 1. So here comes some questions.
Which number should I use? The one from color-space index or base-space index. My BAM files were aligned with colorspace aligner and reads are colorspace reads.
What is the consequence of under or over estimate the effective genome size in macs2 output?
Macs2 documentation does not seem to draw much attention about kmer size. What kmer size was used to calculate those values for 'supported' genomes by macs2?
Thanks.
Agreed - you have to change the number a lot to see any great change in the results. Best thing to reassure yourself is to try changing it for yourself. I usually go for a figure of 75% of the total genome; moving up or down depending. for example, known repetitive sequence content.