Hello everyone,
I've tried to work with the TCGA DNA methylation data, but I'm having problems to understand better these data.
The TCGA's website isn't working anymore to get the biological data. In its place this GDC website (https://gdc-portal.nci.nih.gov/) is working. I was able to get clinical, mRNA and miRNA data in that site, however, I didn't find the DNA methylation data. Isn't there dna methylation data in that portal?
Fortunately, I found another site: cancer genomics browser (https://genome-cancer.ucsc.edu/proj/site/hgHeatmap/) where I was able to get the DNA methylation data for breast cancer (HumanMethylation450).
There are multiple files in the dna methylation file. levels of methylation are in the "genomicmatrix.txt" file, which each sample methylation has beta-value and a probe. On the other hand, the probe.txt file contains correspondence between the probe with the genes. Here is a little example about the genomicMatrixand probe file.
genomicMatrix:
sample TCGA-OL-A66H-01 TCGA-3C-AALK-01 TCGA-AC-A5EH-01
cg13332474 -0.4808 -0.2968 -0.1997
cg00651829 -0.4821 -0.2110 -0.4108
cg17027195 -0.4633 -0.4250 -0.4667
cg09868354 -0.4345 -0.3630 -0.4230
cg03050183 -0.4252 -0.3749 0.1269
cg01989731 NA NA NA
cg06819656 0.4028 0.3047 0.3755
cg04244851 0.4398 0.3894 0.2533
cg19669385 -0.1353 0.3650 0.0664
cg04244855 0.4292 0.4008 0.2468
cg17689707 -0.4842 0.0109 -0.2484
cg04244857 -0.0918 0.2731 -0.0084
cg02434381 -0.4443 -0.4273 -0.4175
cg05777492 -0.4595 -0.4780 -0.4786
cg23340034 0.0933 0.3611 0.4120
cg26361545 0.4339 0.4389 0.4348
cg10609310 0.2913 0.0337 -0.1307
When looking at the file genomicmatrix.txt see several negative and NA values. I thought of disregarding them. Have positive values, I do not find any value above 0.8, ie no hypermethylation values. Why?
Probe:
id gene chrom chromStart chromEnd strand
cg00035864 TTTY18 chrY 8613009 8613010 .
cg13275322 WAS chrX 48426764 48426765 .
cg13798679 chr1 36390157 36390158 .
cg13799227 chr1 226719204 226719205 .
cg13799302 CYP2J2 chr1 60164980 60164981 .
cg13799671 CD58 chr1 116881090 116881091 .
cg13805052 MORN1,LOC100129534 chr1 2272923 2272924 .
Here I consider only genes that are not on chromosome X and Y. I noticed that there are some probes that associate with more than one gene, in this case thought to obtain the median of methylation values to result in final gene methylation level.
I was thinking of converting these files into a single file with the following header:
gene | beta value | sample CD58| 0.4 | TCGA-OL-A66H-01
Please can someone help me these questions?
Thank you for attention.
To the extent the data are interpreted correctly, I would just add 0.5 to each value to get the real beta values... To double check, any female normal sample should have a substantial amount of beta values at around 0.5 after conversion.
To summarize probeset beta values to genes is a complicated scientific question, and I don't think there is a widely acceptable "best" solution.
A simple, useful, but naive, not the best way, is to averaging beta values of all probes that are annotated on that particular gene.