I am trying to identify genomic windows in which the sequence is enriched in CpHpG or CpHpH motifs. Where H is any base that is not a G [ATC].
I have done this already for the di-nucleotide motif CpG by setting a threshold value for Obs/Exp as per EMBOSS's cpgplot, wherein:
Observed = Count of CpG di-nucleotides in a window
Expected = (number of C's * number of G's) / window length
It is easy enough to count the number of CpHpG motifs in a given window, as these cannot overlap. However, I'm not sure how to best calculate the number of CpHpG's that would be observed by chance in a sequence of known length where we know the proportion of all bases.
Any suggestions appreciated!