Question

Need advice regarding k-means clustering and PCA

0

Entering edit mode

9.5 years ago

brs120c • 0

I'm doing PCA on an single-cell RNA-seq data to determine cell types based on their transcriptomic profiles. Being a newbie in Bioinformatics, I have a few questions that I'm hoping I can find answers to here:

Is centering and scaling necessary if you are working with log2-transformed expression values? I'm using prcomp in R.
I'm seeing some interesting sub-clusters emerge when I start with a k of 2 (k-means clustering) then take one of those 2 groups and cluster again using values between 2 and 4. When I start with a large k value hoping to reveal all sub-clusters in one go, the clusters overlap a lot so they don't look like convincing sub-clusters. Is there a drawback to the approach I'm taking where I take samples that fall in one cluster and cluster them again and repeat this until I see no convincing separations?
My PC1 and PC2 in general seem to explain roughly 6% and 4% of the total variance. This sounds really low, but given the noise level in single-cell RNAseq data, is this to be expected? Btw, my dataset has ~10000 genes and 70 samples.

Thanks!

RNA-Seq PCA • 2.6k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.5 years ago by brs120c • 0

2

Entering edit mode

Yes, I would both scale and center log2-expression values. I've forgotten if prcomp does that by default. Given (3), (only 4-6% of variance explained) I would be very cautious in interpreting any sub-clusters you are observing in (2). Are you using all genes in the PCA and k-means, or have you filtered out genes with low variance, or genes which have no significant variation across your experimental or technical groups? Noise and normalization are certainly concerns for single-cell data.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Ahill ★ 2.0k