Interpreting Fractional Methylation Data
2
3
Entering edit mode
11.1 years ago
qliu2011 ▴ 40

If I wanted to binarize fractional methylation data given a WGBS data for all the chromosomes and read coverage, how would I go about doing it? Should I set a hard cutoff (above 0.6 fractional methylation for "methylated" and below for "unmethylated") or should the read coverage be taken into account somehow? Obviously, the binarization is not perfect, but I need it to run my computer simulation. Thanks in advance for the help.

methylation • 8.9k views
ADD COMMENT
4
Entering edit mode
11.1 years ago
B. Arman Aksoy ★ 1.2k

Here is my take on it:

The short answer is: yes; you can define a hard-threshold to binarize your methylation data and as far as I know, the majority of the methylation-related papers do this.

The long answer is: yes; you can define a threshold, but you should do this in a way that helps you explain your phenotype of interest, e.g. gene expression. In this sense, it is also important to know whether you want to work with probe-level or gene-level data.

Let's say you are working with probe-level data; then people are, most of the time, interested in the effect of methylation on transcript levels and this requires you to identify which probe is more informative for you for a given gene and what seems to be the best cut-off for the B-value (beta) that distinguishes the samples (from the normal ones) that have down-regulation in that gene -- and this threshold might be different for each gene (depending on the coverage, promoter sensitivity, CG content of that region, etc.). For example, you sometimes see hyper-methylated promoter regions (B ~ 1) for a gene that do not really show a differential regulation at all. In these cases, would it make sense to threshold the methylation data and call these probes/genes methylated? It depends on what you want to accomplish with your binary data.

I think whatever approach you use will be good as the field does not have a standard way to do things -- everybody seems to be going in his/her way nowadays. As long as you are aware of the artifact you might have in your pipeline, I think the simple binary approach might be the easiest to go, but it is not necessarily the best in terms of explaining biological mechanisms and phenotypic effects.

Oh and you might find the following TCGA guideline useful: https://confluence.broadinstitute.org/display/GDAC/Methylation+Preprocessor

ADD COMMENT
0
Entering edit mode

I would like to distinguish the effect of methylation on gene expression of each specific gene. So, yes, probe-level data binarization might be the best way to go. (Binarize the methylation data for each gene differently.) If I were to do this, what ways might be best for picking the threshold for each gene?

ADD REPLY
1
Entering edit mode

In that case, I think you don't need to binarize the methylation data at all. You can simply try to correlate the B value for a probe to the gene expression level of interest. As described in the TCGA guideline above, you should take the most anti-correlated one and when you do these for all gene expression vs corresponding methylation probe levels, you can then decide on the effect of methylation by looking at these correlation values.

ADD REPLY
2
Entering edit mode
11.1 years ago

If you use Bismark, there are alignment tools for calculating percentage methylation. I think there is a minimum cutoff parameter, but I don't think it actually matters because I'm pretty sure there are read counts for methylated and unmethylated nucleotides in the final result.

http://www.bioinformatics.babraham.ac.uk/projects/bismark/Bismark_User_Guide.pdf

It is not ideal for whole-genome BS-Seq, but my COHCAP algorithm uses methylated and unmethylated thresholds like you described. So, perhaps you can take a look at the paper for analysis ideas (since I think it is a good choice for targeted BS-Seq analysis):

http://nar.oxfordjournals.org/content/41/11/e117.long

ADD COMMENT
0
Entering edit mode

Hi, so your paper quotes: "The CpG site analysis is based on the method described in Sproul et al. (44), where sites are defined as methylated if they show a percentage of methylation (beta) greater than a certain value (0.7 for cell line data, 0.3 for patient data) and sites are unmethylated if they have beta values <0.3 (by default)...We extended this algorithm to include a P-value and false-discovery rate [FDR, using the method of Benjamini and Hochberg (45)] value as cutoffs for differential expression. The method of P-value calculation varies based on the number of groups considered for the analysis (one group, two groups, three or more groups; Supplementary Table S2, Supplemental Methods)."

It appears that you are using a hard cutoff of 0.7 for considering a site as methylated. But then you state that if a site has a fractional methylation value of <0.3, it is not methylated. What happens to the values in between 0.3 and 0.7? Thanks for the help!

ADD REPLY
1
Entering edit mode

When working with pretty clear cell line data, there are not a lot of sites with beta values between 0.3 and 0.7. Thus, you could call the intermediate values either ambiguous or hetrozygous (one methylated and one unmethylated allele). For the Illumina array data, I occasionally saw an intermediate "heterozygous" peak, but I usually only saw clear peaks > 0.7 and < 0.3. For BS-Seq, the distribution is different (but I think the bimodal peaks I saw were even sharper, making the intermediate methylation values less of a deal).

Another option is to consider a delta beta cutoff (where I would recommend something like 0.2). However, this doesn't meet your original criteria.

Although your signal distributions should look different for sites versus CpG islands. The discussion above was primarily for CpG site characterization prior to CpG Island analysis.

ADD REPLY

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6