Question

Why is sliding window applied to CpG methylation level calculation ?

0

Entering edit mode

9.0 years ago

hxlei613 ▴ 100

In this article : The DNA methylation landscape of human early embryos,the author mentioned 100-bp-tile-based DNA methylation calling algorithm (they used RRBS to detect 5mC/5hmc).

The algorithm is described like this: first,genome is binned into consecutive 100-bp tiles.The number of reported C, divided by the total number of reported C and T captured in the 100-bp tiles,is interpreted as the 100-bp-tile averaged DNA methylation level.The DNA methylation level of each sample is the average of the 100-bp tiles.

Why can't we just average every methylated C level ? What's the advantage of sliding window ?

Thank you :)

methylation • 3.6k views

ADD COMMENT • link updated 9.0 years ago by natasha.sernova ★ 4.0k • written 9.0 years ago by hxlei613 ▴ 100

score 1 · Answer 1 · 2016-08-15

1

Entering edit mode

9.0 years ago

natasha.sernova ★ 4.0k

It's a tradition.

See this paper:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3592415/

“Sliding window is a traditional method for pre-defined regions that are arbitrarily chosen and not taken the actual methylation status of CpGs into consideration.”

or this one:

https://www.bioconductor.org/packages/release/bioc/vignettes/MethTargetedNGS/inst/doc/MethTargetedNGS.pdf

1.3 Methylation Entropy chapter, there is also sliding window used.

"This function return vector of methylation entropy values using sliding window of 4."

ADD COMMENT • link 9.0 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

I found the BSmooth (http://www.ncbi.nlm.nih.gov/pubmed/23034175) paper provides a justification for the use of smoothing:

This has led most WGBS studies to employ a high coverage design since even 30× coverage yields standard errors as large as 0.09. However, various authors have noted that methylation levels are strongly correlated across the genome [24,25]. Furthermore, functionally relevant findings are generally associated with genomic regions rather than single CpGs, either CpG islands [26], CpG island shores [27], genomic blocks [1], or generic 2 kb regions [3].

They then concluded the following:

Using this method [BSmooth] on data with 4× coverage, we achieved precision comparable to deeper coverage without smoothing.

So my guess is that one answer could be that smoothing/windows allowed lower coverage sequencing through still having low standard errors associated with the (average/smoothed) DNA methylation level. This is of course at the cost of resolution in resolving individual CpGs.

ADD REPLY • link 9.0 years ago by Collin ▴ 1000

0

Entering edit mode

My guess is that they once had a dataset with either low coverage or a lot of noise. The sliding window would allow you to handle that and still assign values to focal regions/points. There's no other good reason that I know of to do this and it's not something I would personally do by default.

ADD REPLY • link 9.0 years ago by Devon Ryan 105k