I'm trying to segment methylation data into lowly-methylated regions such as DNA methylation valleys (DMV)s, and most papers I've read cite Xie (2013) when mentioning "DMV"s. In that paper, they say these regions were obtained using the same method as Stadler (2011) , and in _that_ paper it says "Segmentation was performed using the R package RHmm with a three-state HMM corresponding to fully methy-lated, low-methylated and unmethylated CpGs". Based on that, I went here:
https://rdrr.io/rforge/RHmm/man/
And applied HMMFit
with nStates=3
to my 1-D array of CpG methylation values, genome-wide (perhaps at this point, I'm already going the wrong way with this?). From this I get three mean values, but no well-defined array of the "states" of the loci; hence the next step ( I think) is to run viterbi()
on the output of HMMFit
, but this is always crashing with the error message
Error: protect(): protection stack overflow
even with a relatively small array of size 10000 and with a quite powerful machine. It's odd, because I would have assumed that this algorithm was simply assigning states one at a time based on the HMM model --why should that eat up so much memory? In any case, there's just no chance that this could run on whole-genome data.
So that's where I'm stuck; I see no realistic way of annotating DMVs (or UMRs/LMRs) in WGBS data de novo. Is there a newer method? (the papers above are quite old, and yet they continue to be cited) Is more modern, memory-friendlt software available?
Thanks in advance for any help or advice you can offer.