Hi,
I am trying to reuse some publicly available results of bisulfite methylation analysis. These contain mc tables which do not have strand information. The data uses gencode v17 hg19 in their analysis. In the past I have generated all gc positions in the hg19 human genome using bismark bedgraph2cytosine modules which gives all cg positions in the genome according to the author.
I am trying to calculate regional methylation of the bisulfite results which have values between 0 and 1 (for any particular cpg position). My idea was to normalize the total exonic methylation (sum of these values over the exon) and normalize them with GC content of the exon.In order to do so I have considered only the cpg positions on one strand (+ strand in this case)
So when I divide the total exon methylation (sum of bisulfite ratios or values) by total GCs in that exon for some exons I am getting methylation greater than 1. What could be the possible reasons for it I was wondering since the number of GCs should remain the same even if gencode version changes (correct me If I am wrong).
This is the way I have calculated the gc content for exon once I get results from bismark after only using one strand.
coverageBed -a chr1_gcregions.txt -b chr1_exons.txt |cut -f1,2,3,4,5,6,7 > exon_gccount.txt
Let me know if I am doing something wrong.
Thanks
I think one possible reason is that you are using data that might have information for top and bottom strand CpG sites and then normalizing only to top strand CpGs. Are you getting some regional methylation values that are close to 2 after normalization? It would help if you could point to the data you are using.
If I divide them by (total of top and bottom strand cpgs I get most values <= 0.5 with some being a little more than that.
Its actually blueprint epigenome data
The files look somewhat like this :-
There are values of average regional methylation (averaged over GC) that are higher than one these are very few and like these :-