Question

PCA: log transform and scale=T/F?

0

Entering edit mode

4.5 years ago

camillab. ▴ 160

Hi, I am trying to perform PCA on my dataset and looking in the different blogs I understood I have to log transform my data to reduce the impact of a potential outliers but I have a doubt whether I should scale or not after the log transformation. PCA with and without standardizing will give different results because scaling the data will make the SD 0 or 1. In this blog https://www.r-bloggers.com/computing-and-visualizing-pca-in-r/ they suggest to log transform and then center and scale the dataset when computing PCA. whereas for example here https://jackauty.com/pca-and-3d-pca/ if they set SCALE=TRUE they do not perform a log transformation prior computing the PCA (and vice-versa).

Thank you for help!

Camilla

R RNA-Seq PCA log scale • 3.7k views

ADD COMMENT • link 4.5 years ago by camillab. ▴ 160

1

Entering edit mode

I personally perform PCA on log2-transformed normalized counts for standard RNA-seq. Scaling has the advantage that all your genes are on the same scale regardless of expression level but has the disadvantage that lowly-expressed genes have the same impact as highly-expressed genes. One might argue that in general genes with higher expression levels are more relevant overall towards shaping the cellular identity compared to genes that are at the edge of the detection level. There are exceptions for that for sure, but in general I would trust the more highly-expressed genes simply because you have more information (reads) coming from them. So no, I'd not scale data in standard PCA for bulk RNA-seq. Single-cell applications are a different story.

ADD REPLY • link 4.5 years ago by ATpoint 85k

0

Entering edit mode

so would you log transform and then set scale=F right? or would you not even log transform?

ADD REPLY • link 4.5 years ago by camillab. ▴ 160

0

Entering edit mode

I personally perform PCA on log2-transformed normalized counts

So no, I'd not scale data in standard PCA

ADD REPLY • link 4.5 years ago by ATpoint 85k

0

Entering edit mode

depends on your input data. Could you describe your input dataset please ?

ADD REPLY • link 4.5 years ago by Nicolas Rosewick 11k

0

Entering edit mode

so my dataset is a bluk rnaseq dataset (column the sample and in rows the genes). then depending which package I am using I transpose or not. and they are RPKMs so normalised data, I know in theory you should not perform PCA on normalised count but I cannot access to the FASTQ data

# A tibble: 17,352 x 4
   P0_Neo24 P0_Untreated P4_Neo24 P4_Untreated
      <dbl>        <dbl>    <dbl>        <dbl>
 1    1.70         2.26     0.560        0.706
 2   24.8         24.9     38.4         37.5  
 3   26.2         28.9     29.7         26.9  
 4   15.9         16.1     16.3         16.8  
 5   55.8         53.7     44.3         45.7

ADD REPLY • link 4.5 years ago by camillab. ▴ 160