Hello! I'm a tech who's starting a Ph.D. in August, so I am very new to bioinformatics. I'm wondering if anyone has some guidance on choosing min. reads per cytosine when calling DMRs.
I am using DMRcaller on WGBS data. I have 3 biological replicates per condition, which I am pooling and analyzing together.
In the documentation/example code, I notice they set the min. reads per cytosine = 4. Is this number arbitrary? Or is there something in the data that informs you of what the minimum number of reads should be? I have the mean/median coverages & coverage distributions for my samples, not sure if this will guide me.
In other DMR analysis tools, I've seen the default set equal to 10, or 3, etc. and I can't find a clear answer on where these numbers are coming from.
Thanks in advance for any advice!
IMO it is usually pretty arbitrary. Because methylation is computed as ratio of C reads / total reads, the total reads per cytosine will define the resolution and confidence of your methylation values. If you are interested in single CpG sites, you may want more resolution than when calling bigger DMRs. In my experience I usually try to set at least a minimum of 10 coverage, though you will have to consider if you lose too many sites depending on this threshold. There are some packages such as DSS (which I have never used) which consider read depth during the computation of DMRs so you don't have to pre-filter. Other packages try to compensate sparsity in cytosines sites by smoothing, such as bsseq.