Question

Forum:100% variance explained on one PC in DESeq2 PCA?

0

Entering edit mode

3.3 years ago

melatoninixo ▴ 10

Hi all, I have tried using the DESeq2 plotPCA function after vst normalization of my dds object. However, the PCA plot I have obtained were samples separated on a single principal component PC1 with 100% variance explained. How is this possible and is it even appropriate to proceed with DEG analysis from here?

RNA-Seq DESeq2 • 4.4k views

ADD COMMENT • link 3.3 years ago by melatoninixo ▴ 10

0

Entering edit mode

This is intriguing me. Is it possible that you could share your vst-normalized data that you are using for the PCA?

ADD REPLY • link 3.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

sure, here is the drive link to it

ADD REPLY • link 3.3 years ago by melatoninixo ▴ 10

0

Entering edit mode

Thanks for the data. This doesn't look like a realistic data to me as all the genes seem (sorry I'm on mobile, so only had a quick look) to be expressed at almost the same level (9-11). Nevertheless, unless there is a distant 'outlier' in the data, I don't see how PCA can give 100% variance in first component.

ADD REPLY • link 3.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

There is quite a significant portion of significantly upregulated/downregulated genes after DESeq2 though... I have approximately 700+ for each category with log2FC >1.5/<-1.5

ADD REPLY • link 3.3 years ago by melatoninixo ▴ 10

0

Entering edit mode

If the mean of the controls are differing from the mean of the treatments by >=1.5, you can easily get the significant genes (eg. Control is 9, treatment is 11). However that doesn't justify why all genes are expressed at almost same 'high' level.

Is this the full data? If yes, which genome is this as human genome contains >50k (all kinds of) genes

ADD REPLY • link 3.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

Yes, its the full data. I'm working on yeast

ADD REPLY • link 3.3 years ago by melatoninixo ▴ 10

score 4 · Answer 1 · 2022-01-18

4

Entering edit mode

3.3 years ago

Mensur Dlakic ★ 29k

It means one of two things: 1) One of your features (or some linear transformation of it) is sufficient to explain all your data. 2) You have a feature in your data with greater numeric spread than all others, so all other features are squeezed into a small range and their contributions end up being insignificant.

Option #1 is possible, but option #2 is more likely. If you haven't done so already, I suggest you try normalizing the data so all variables end up on a comparable scale.

ADD COMMENT • link 3.3 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thank you, does this mean that I can still proceed with the DEG analysis, but perhaps further normalization is required for PCA? I have used the normalization that comes with DESeq2 via the vst function and only specified the treatment condition in my design formula. Could that be the issue?

ADD REPLY • link 3.3 years ago by melatoninixo ▴ 10

0

Entering edit mode

Those are very good observations. However, thinking in biological terms, option 2 will mean that the expression of one gene is varying in an extremely high range compared to all others - a somewhat unrealistic expectation as genes are not expressed in isolation. Though, it could be also possible that the data is not in log-scale, which is createing unwanted variance.

PS: I guess, you meant standardization of variables, instead of normalization.

ADD REPLY • link 3.3 years ago by Santosh Anand 5.8k