Question

PCA plot analysis

0

Entering edit mode

9 months ago

Sanjukta • 0

I am not sure, if this question fits here. Maybe part of it does.

I have generated these two PCA plots on Galaxy using two different dataset. In both each group has several clusters. Also in the Obese vs Control dataset, the PC1 is very high 81%. So, according to my understanding none of the datasets are good to go.

Obese dataset: enter image description here

T2D dataset: enter image description here

My questions are:

How do we know if a PCA plot is good? From online tutorials, the PC1 should not be very low or very high, a PC1 of 40-60% is good. Also, the samples should be neatly clustered into a single cluster.
Can I still take forward these two datasets by picking the closely clustered samples? What is the minimum number of samples one should have to perform RNA-Seq in each group of a dataset?

Is there a statistical way to figure out which sub-clusters to choose for the next level of analysis?

enter image description here

rna-seq • 655 views

ADD COMMENT • link updated 9 months ago by jared.andrews07 ★ 18k • written 9 months ago by Sanjukta • 0

score 2 · Answer 1 · 2024-02-26

How do we know if a PCA plot is good? From online tutorials, the PC1 should not be very low or very high, a PC1 of 40-60% is good.

Do not pin a % of variability explained by PC1 to dataset quality, these do not necessarily correlate at all. Your PC1 may explain a relatively low or high percentage of the variability of the data, but that doesn't necessarily have anything to do with data quality.

Also, the samples should be neatly clustered into a single cluster.

In reality, it is not uncommon for there to be multiple groupings. The data is almost always more complicated than just x versus y, and there are usually more variables that differentiate the samples. This is especially true for human disease datasets. Sometimes, PC1 may not separate samples based on your condition of interest at all, but PC2, 3, 4, etc, might. Look at more than just PC1 and 2.

Can I still take forward these two datasets by picking the closely clustered samples?

You certainly can still use these datasets.

What is the minimum number of samples one should have to perform RNA-Seq in each group of a dataset?

This is a bit of a loaded question, and the answer depends on how much power you really want/need. There are a few papers that look at this with varying answers, like this one that recommends 6 for decent power and up to 12 if you really want to capture every difference. In practice, people typically shoot for at least 3.

Is there a statistical way to figure out which sub-clusters to choose for the next level of analysis?

There are all sorts of clustering algorithms to help break up samples, but first, you should take a look at other variables in the data that might explain why some samples tend to group together in a given PC. Depending on the data type, things like sex, tissue origin, age, read depth (scATAC is a prime example here, PC1 is always dominated by read depth differences), culture conditions, etc, can all play into how similar any set of samples appear in PCA.

score 1 · Answer 2 · 2024-02-26

From online tutorials, the PC1 should not be very low or very high,

This is a bad way to think about it. Your data is what it is. To me, 81% is high, honestly, I usually don't see one set of genes driving so much variance unless I am looking at different tissues, and I would be very surprised that diet in the brain would make such a drastic change.

The data is what it is, I'd start by figuring out what genes are driving PC1, and see if that makes sense given what you are trying to learn from this dataset. If you found that they were all, say, sex genes, or all circadian genes, then you would conclude that those differences are going to make it harder to see the differences you care about. If these genes are involved in pathways that you expect to be impacted in the brain by diet, then great, your experiment is fine, because your treatment is a sledgehammer, and your PCA reflects that.

Can I still take forward these two datasets by picking the closely clustered samples?

I absolutely would not omit any samples from the analysis based on these PCA plots. Given how small your experiment is, I would at most omit one, if it was a clear outlier from all other samples, on the grounds that there might be something wrong with that one. But that is not your situation.

This might result in seeing virtually no gene signatures in the T2D samples. Cherry-picking two clusters of samples is a horrible, horrible idea.

The goal is not to get a list of DE genes. The goal is to report honestly what DE genes you see using honest analysis techniques. You cannot massage the samples to get the outcome you want. Given that you are looking at brains and the treatment is metabolic conditions, and you only have a handful of samples, I'm not sure you should expect a strong effect.