Question

PCA in gene expression

0

Entering edit mode

20 months ago

ali • 0

Dear colleagues,

I would like to seek your expert opinions regarding the use of principal component analysis (PCA) in The Cancer Genome Atlas (TCGA) dataset. Specifically, I would like to discuss the issue of sample separation in the PCA plot obtained after normalizing the COAD data using limma or DESeq2. As you may have observed, some samples are separable while others are not, possibly due to factors such as heterogeneity. In this regard, I would appreciate your insights on whether we should exclude these non-separable samples from further processing or include them in our analysis.

Thank you for your valuable input.

PCA TCGA • 1.2k views

ADD COMMENT • link 20 months ago by ali • 0

0

Entering edit mode

Can you upload a picture of the PCA biplot for the benefit of those that have not seen it.

ADD REPLY • link 20 months ago by i.sudbery 20k

score 2 · Answer 1 · 2023-03-23

2

Entering edit mode

20 months ago

i.sudbery 20k

No, you should definitely not remove samples from a PCA on the basis that they are not separable between conditions. To do so would artificially reduce the estimated variance between samples, and lead to an artificial increase in average log-fold changes, and inflate p-values. The heterogeneity here could be reflecting technical issues, but it could reflecting genuine biological heterogeneity, and you shouldn't just pretend that that doesn't exist.

Any outlier removal scheme should be independent of the difference between conditions. DESeq2 has built in outlier detection and removal on a per gene basis. On a per sample basis you might have something like removing samples whose PC1/PC2 values are more than 3 or 4 standard deviations away from the mean of the same condition.

You might also try reducing hetrogenity by using something like Combat or SVA.

ADD COMMENT • link 20 months ago by i.sudbery 20k

0

Entering edit mode

Thanks for your attention. yes I know that we should not omit the samples, I just worry about noises on my differential expression and some things like that.

ADD REPLY • link 20 months ago by ali • 0

1

Entering edit mode

The noise on your differential expression is exactly the point. If samples are noise, then genes should be called differential.

ADD REPLY • link 20 months ago by i.sudbery 20k

0

Entering edit mode

Thanks for your help

Dear i.sudbery I have been working on a project related to the correlation between genetic mutations and drug resistance, particularly mutations in the p53 gene, for the past six months. Given your expertise in the field of bioinformatics, I would be delighted if I could benefit from your experience to improve and enhance the level of this project.

ADD REPLY • link 20 months ago by ali • 0