Dear colleagues.
I have a dataset with 5 groups : 3 consists of patients with different cancer types, one consists of patients with benign tumour, another is a healthy control group. Protein concentrations are measured using a method which either assigns 0 as intensity value if the concentration was not detected or gives some value (although since these are relative (not absolute) concentrations , mathematically meaningless). Sample size is rather small - <90 subjects overall.
When they do PCA on the whole set of predictors (200+ proteins) - there are no distinctive patterns between the groups. However, when they use the subset of proteins (~50, based on literature review - which showed some association with cancer in previous studies) - there is some mild separation visible between cancer and other groups.
How can this situation be interpreted - linear combination of all features shows no separation of data (including major proteins like albumin, haemoglobin..AND cancer-related ones.), while subset of features (very mildly) does? From biological and technical perspective?
I would say exact interpretations depend more on the nature of the data, experimental setup, and data processing steps. You provided some good info about possible values, but I would say not enough to exactly interpret. E.g. what is the distribution of values? Are 0s excluded? Are most proteins 0?....
In general, to my understanding, PCA plots separate items based on variability between samples. In RNA-seq, it is common to perform usually only the top variable genes between samples/groups. It could make sense that the whole set shows less variability, or less separation, than 50 select proteins, especially if data is noisy or many proteins have similar values.
If you only have a few hundred proteins, I'd be more interested in hierarchical clustering and visualizing with a heatmap.