Hello,
I am processing a dataframe that consists of about 55000 genes(TPM values,no access to raw data) and 400 samples. After removing the zero variance genes, I am performing a PCA on the samples trying to detect outliers. I have noticed that there are consistently 2 different populations of samples. I have tried to log2 and center/scale my data but the effect remains. Then I filtered the samples by race and sex with no effect. Is this behaviour to be expected? Can it be batch effect?
I have also uploaded the dendrogram of my data derived through:
sampleTree = hclust(dist(n13_pca_scz_min), method = "average");
I draw the red line which gives me a cluster of around 240 samples (everything below the line). Is it correct to go on to wgcna analysis using just them?
Thank you,
Costas