Entering edit mode
9.1 years ago
brs120c
•
0
I'm doing PCA on an single-cell RNA-seq data to determine cell types based on their transcriptomic profiles. Being a newbie in Bioinformatics, I have a few questions that I'm hoping I can find answers to here:
- Is centering and scaling necessary if you are working with log2-transformed expression values? I'm using prcomp in R.
- I'm seeing some interesting sub-clusters emerge when I start with a k of 2 (k-means clustering) then take one of those 2 groups and cluster again using values between 2 and 4. When I start with a large k value hoping to reveal all sub-clusters in one go, the clusters overlap a lot so they don't look like convincing sub-clusters. Is there a drawback to the approach I'm taking where I take samples that fall in one cluster and cluster them again and repeat this until I see no convincing separations?
- My PC1 and PC2 in general seem to explain roughly 6% and 4% of the total variance. This sounds really low, but given the noise level in single-cell RNAseq data, is this to be expected? Btw, my dataset has ~10000 genes and 70 samples.
Thanks!
Yes, I would both scale and center log2-expression values. I've forgotten if prcomp does that by default. Given (3), (only 4-6% of variance explained) I would be very cautious in interpreting any sub-clusters you are observing in (2). Are you using all genes in the PCA and k-means, or have you filtered out genes with low variance, or genes which have no significant variation across your experimental or technical groups? Noise and normalization are certainly concerns for single-cell data.