I would like to seek your expert opinions regarding the use of principal component analysis (PCA) in The Cancer Genome Atlas (TCGA) dataset. Specifically, I would like to discuss the issue of sample separation in the PCA plot obtained after normalizing the COAD data using limma or DESeq2. As you may have observed, some samples are separable while others are not, possibly due to factors such as heterogeneity. In this regard, I would appreciate your insights on whether we should exclude these non-separable samples from further processing or include them in our analysis.
No, you should definitely not remove samples from a PCA on the basis that they are not separable between conditions. To do so would artificially reduce the estimated variance between samples, and lead to an artificial increase in average log-fold changes, and inflate p-values. The heterogeneity here could be reflecting technical issues, but it could reflecting genuine biological heterogeneity, and you shouldn't just pretend that that doesn't exist.
Any outlier removal scheme should be independent of the difference between conditions. DESeq2 has built in outlier detection and removal on a per gene basis. On a per sample basis you might have something like removing samples whose PC1/PC2 values are more than 3 or 4 standard deviations away from the mean of the same condition.
You might also try reducing hetrogenity by using something like Combat or SVA.
Thanks for your attention. yes I know that we should not omit the samples, I just worry about noises on my differential expression and some things like that.
Dear i.sudbery
I have been working on a project related to the correlation between genetic mutations and drug resistance, particularly mutations in the p53 gene, for the past six months. Given your expertise in the field of bioinformatics, I would be delighted if I could benefit from your experience to improve and enhance the level of this project.
Can you upload a picture of the PCA biplot for the benefit of those that have not seen it.