I am doing a study of bisulfite sequencing data from a bunch of samples, and I'm interested in looking into epialleles, epipolymorphisms, and measures of DNA methylation disorder.
One thing I keep seeing in papers is that every study seems to limit their analysis to only epialleles with 4CpGs, even if they have sequences with > 4 CpGs they trim them to 4CpGs.
I was wondering why they don't analyze longer epialleles if they have the capability? I understand that to get a true measure of the epiallele composition of a locus with >4 CpGs you would need higher sequencing coverage due to there being a higher number of possible epialleles. But if you see a locus of 6CpGs, (and therefore 2^6 = 128 possible epialleles) that consistently has only 2 epialleles (say CCCCCT and CCCCCC) each covered at 30x, isn't it reasonable to say that this sequence is tightly regulated, because even though there are vastly more epialleles possible, you only see 2/128 ?
I guess I am just wondering why studies of methylation disorder (methylation heterogeneity) limit their analyses to only 4CpG blocks, when they could calculate similar statistics for longer epialleles. ANy insights on this topic would be helpful because I have found literally zero explanation in the papers I've read.
I understand that it is difficult to find regions with > 4 contiguous CpGs, but I don't understand why they specifically exclude regions that DO have > 4 contiguous CpGs. I think it may have something to do with the statistic calculations having different biases for differnt length epialleles, but I Haven't quite figured it out yet. Do you have any ideas on why they intentionally trim longer alleles to 4 CpG ones?