I am analysing the 450K DNA methylation data from TCGA(GDC). I am new to this analysis and I had a basic doubt. Looking at rowData of the summarized experiment obtained from TCGABiolinks basically at the CpG probe data.frame, there are few things that I find confusing. First a single CpG probe is getting mapped to same gene multiple times that is specified by the Gene_Symbol column. I interpret as these are due to different exons for the gene. But then what should I interpret as the position of the CpG site w.r.t TSS even though that CpG maps to the same gene but each has a different position.
Second there are many CpG probes that map to more than one gene or other elements. Would it be preferable in this case to remove such CpG sites. A count of CpG sites that map to more than 1 gene or other mRNA, yielded more than 100K such probes.
Thanks in advance for any help in this regard.
Hi Kevin, Thanks for the reply. I understand both the points. However papers I came across were using a gene level methylation value for integrating expression and methylation data. An approach listed was to compute average of all CpGs within 1500 BPs of TSS. However since a single CpG maps to same gene multiple times each having different start sites is not straightforward as I thought it would be (read is getting aligned to different exons of the same implying the sequence is preserved across the different exons perhaps). Anyways I shall try to figure out. Thanks for the help.
thank you for you answer Kevin. I have question. should I extract mean value for multiple probes targeting the same gene? and also, second question. should I consider the same beta values for multiple genes targeting a probe ID?
I am not sure because methylation at different parts of the gene can have different effects. It may be a 'case by case' basis. By averaging the values, you may be losing some important signal. You could check the probes in UCSC Genome Browser to see where exactly they are targeting.
Oh yes, the beta value should be the same. I would keep these as single entities, though. So, the record could be:
<h5>#</h5>Generally, in methylation studies, in my opinion, the data should always be kept at the level of probes, not genes.