Hi,
I have a question about TCGA methylation 450K data.
When you look at the TCGA methylation beta values,
Level 2 data has all the values, but I found many Level 3 probes have NAs (e.g., cg00000108, cg00000109, etc).
Level 2
Composite Element REF Methylated_Intensity Unmethylated_Intensity Detection_P_value
cg00000029 2488.00579881129 2281.3142892634 0
cg00000108 8943.62421381116 336.745081332759 0
cg00000109 3827.0493383932 219.47270455192 0
cg00000165 263.820225926362 2355.4623873349 0
cg00000236 3733.92206994152 722.124674419151 0
Level 3
Composite Element REF Beta_value Gene_Symbol Chromosome Genomic_Coordinate
cg00000029 0.521668865344633 RBL2 16 53468112
cg00000108 NA C3orf35 3 37459206
cg00000109 NA FNDC3B 3 171916037
cg00000165 0.100722321673368 1 91194674
cg00000236 0.837944995677383 VDAC3 8 42263294
There are so many NAs and I wonder why.
I thought they were filtered out because of detection p-value but when I downloaded the IDAT files and calculated detection p-values, they were all below than 0.01. So, they were not filtered out because of detection p-value.
Additionally, they are not on the chrX/Y, they are not SNPs, they are not cross-reactive probes.
There are ~90k NAs per sample. Almost 1/5 of 450k.
Why there are so many NAs in the 450k methylation beta data?
And does anyone know how they normalized the data from raw IDAT files?
I searched hard but couldn't find..
I already tested the data with detection p-values.. (mentioned in the question) The NAs were produced not because of detection p-values. And it is not a problem of few samples. All samples in TCGA COAD methylation beta values have ~90k NAs - 1/5 of 450K.
I think I've only checked the missing values for the BRCA 450k data.
It's possible that some batches had bigger problems than others, but it is hard for me to say for certain.
Are the the probes random, or do certain probes tend to be missing more often than others?
If I look at APC in the Xena Browser for GDC TCGA Colon Cancer (COAD) (or TCGA Colon Cancer (COAD)), a vary large portion of the 450k arrays were flagged as missing. I think that is actually closer to 1/3 rather than 1/5 missing/filtered samples, but I don't know if there were also a noticeable amount of samples that didn't have 450k arrays for that cancer type (although I think that part could be determined through the GDC). I don't think TCGA is doing quality filtering for that they make available (so other people can make their own assessments).
So, with a quick assessment, I think that could match what you are saying about needing to remove a lot of COAD 450k arrays due to quality filters.
I also apologize that I am not answering your question about the alternative causes for the missing probes, but I think this is a good question.
I also asked this to GDC, but they didn't know the details as well. I'm just guessing there were some kind of filtering processes but don't know what exactly they are. Remains a mystery.