Hello Biostars Community,
I am trying to see the similarity/differences of samples and if they cluster together into sample groups, but am unsure whether centering and scaling should both be done. I think that centering should be done, but scaling I am unsure about.
I would think that maybe scaling shouldn't be done since all the values are on the same beta value scale of 0 to 1 with most of the values being either 0 or 1 and some being in between 0 and 1:
The distribution looks something like the black All line below:
The PCA clustering does "look better" when both centering and scaling, meaning the samples group together more or less into their respective groups better than just simply centering (without scaling).
I am pretty sure centering should be done after watching some StatQuest videos on PCA. I am just unsure whether scaling should be done or not and if distribution of the data matters?
Image obtained from: https://zhou-lab.github.io/sesame/dev/supplemental.html#Quality_Control
Thank you in advance.
- Pratik
It's hard for me to imagine that scaling is important for methylation data. Scaling is simply there to control weights and it's hard to believe probes with less variance should have the same weights as those with larger variance.
That being said, is it possible this be an artifact from probe design? say the chip has less probes in regions with less variation, but more in regions with more. So those highly variable regions are over-represented already.
Thank you for taking the time to reply Zhenyu Zhang.
The beta values for the DMRs look more or less uniform across sample groups. That is, where there is a DMR, all samples within the sample group show the same/very similar methylation beta value. There are just a few DMRs that I saw where a few samples show different/variable methylation from the rest of the sample group.
What would you do:
to scale (the two groups on the left [red and green] are normal and tumor matched samples - the two groups on the right [blue and purple] are normal and tumor matched samples):
or not to scale: