Hi, I am trying to perform PCA on my dataset and looking in the different blogs I understood I have to log transform my data to reduce the impact of a potential outliers but I have a doubt whether I should scale or not after the log transformation. PCA with and without standardizing will give different results because scaling the data will make the SD 0 or 1. In this blog https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/ they suggest to log transform and then center and scale the dataset when computing PCA. whereas for example here https://jackauty.com/pca-and-3d-pca/ if they set SCALE=TRUE they do not perform a log transformation prior computing the PCA (and vice-versa).
Thank you for help!
Camilla
I personally perform PCA on log2-transformed normalized counts for standard RNA-seq. Scaling has the advantage that all your genes are on the same scale regardless of expression level but has the disadvantage that lowly-expressed genes have the same impact as highly-expressed genes. One might argue that in general genes with higher expression levels are more relevant overall towards shaping the cellular identity compared to genes that are at the edge of the detection level. There are exceptions for that for sure, but in general I would trust the more highly-expressed genes simply because you have more information (reads) coming from them. So no, I'd not scale data in standard PCA for bulk RNA-seq. Single-cell applications are a different story.
so would you log transform and then set scale=F right? or would you not even log transform?
depends on your input data. Could you describe your input dataset please ?
so my dataset is a bluk rnaseq dataset (column the sample and in rows the genes). then depending which package I am using I transpose or not. and they are RPKMs so normalised data, I know in theory you should not perform PCA on normalised count but I cannot access to the FASTQ data