Question

PCA result and batch effect?

1

Entering edit mode

3.2 years ago

k0stasmp ▴ 10

Hello,

I am processing a dataframe that consists of about 55000 genes(TPM values,no access to raw data) and 400 samples. After removing the zero variance genes, I am performing a PCA on the samples trying to detect outliers. I have noticed that there are consistently 2 different populations of samples. I have tried to log2 and center/scale my data but the effect remains. Then I filtered the samples by race and sex with no effect. Is this behaviour to be expected? Can it be batch effect?

I have also uploaded the dendrogram of my data derived through:

sampleTree = hclust(dist(n13_pca_scz_min), method = "average");

I draw the red line which gives me a cluster of around 240 samples (everything below the line). Is it correct to go on to wgcna analysis using just them?

Thank you,

Costas

PCA

Dendrogram

PCA • 1.2k views

ADD COMMENT • link updated 3.2 years ago by jared.andrews07 ★ 18k • written 3.2 years ago by k0stasmp ▴ 10

score 3 · Answer 1 · 2021-09-30

I'd highly recommend trying an eigencorplot from PCAtools to determine what variable is driving that difference. You can then account for it or subset samples as appropriate. I wouldn't toss one population or the other, but analyze them side by side to see how similar the results from each are. There could be interesting biology there (or not), but you won't know if you don't look.

score 2 · Answer 2 · 2021-09-30

Batch effects are an important source of variation in all types of NGS studies.

Although race and sex do not appear to be a major source of variation in your data, there may be technical factors that are causing this variation (i.e., a batch effect).

If you have access to these factors, e.g.,

The technician who processed each sample (if there were multiple technicians)
The sequencing batch (if there were multiple sequencing runs)
The date of sample collection (if samples were collected on different dates)
Others

you can color your samples by those and see if they reveal any additional insights. If so, you can account for that variation in your statistical analysis.

Hope this helps!