Hi everybody ! I'm trying to use PCA over Rna-seq data just to understand PCA. I saw pcaExplorer and I did not quite understand what it actually does. I have a count matrix of 90 samples (healthy and cancer). I have then thousands of genes. I want to obtain the genes that differ my samples the most for further classification ( I know there are other techniques but I just wanted to exercise a bit, with PCA ). Thats what I have done :
Let z be the transpose of a count matrix, normalized with method TMM = samples as rows and genes as columns
z <- t(counts.tmm)
pca <- prcomp(z, center = TRUE,scale. = TRUE)
summary(pca)
pca.class <- sample.cond$class # here sample.cond$class will be HC or Pancreas (Tumor)
ggbiplot(pca,ellipse=TRUE,groups = pca.class,choices= c(1,2),var.axes=FALSE)
dev.off()
The result is the following :
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 22.6979 8.01951 6.89380 5.85445 5.39476 5.30267 5.2139 5.00639 4.99198 4.96105 4.85406 4.7934
Proportion of Variance 0.2937 0.03667 0.02709 0.01954 0.01659 0.01603 0.0155 0.01429 0.01421 0.01403 0.01343 0.0131
Cumulative Proportion 0.2937 0.33039 0.35749 0.37703 0.39362 0.40965 0.4251 0.43944 0.45365 0.46768 0.48111 0.4942
PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24
Standard deviation 4.74638 4.69116 4.66121 4.62983 4.56994 4.55543 4.50351 4.46740 4.44247 4.36019 4.32935 4.3111
Proportion of Variance 0.01284 0.01255 0.01239 0.01222 0.01191 0.01183 0.01156 0.01138 0.01125 0.01084 0.01069 0.0106
Cumulative Proportion 0.50706 0.51960 0.53199 0.54421 0.55612 0.56795 0.57951 0.59089 0.60214 0.61298 0.62367 0.6343
PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36
Standard deviation 4.26250 4.21480 4.15619 4.14941 4.04881 4.01388 3.94854 3.93874 3.91634 3.86539 3.81318 3.79119
Proportion of Variance 0.01036 0.01013 0.00985 0.00982 0.00935 0.00919 0.00889 0.00884 0.00874 0.00852 0.00829 0.00819
Cumulative Proportion 0.64462 0.65475 0.66460 0.67441 0.68376 0.69295 0.70183 0.71068 0.71942 0.72794 0.73623 0.74443
PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48
Standard deviation 3.77294 3.73427 3.72646 3.62194 3.59297 3.56278 3.51904 3.42127 3.38418 3.37216 3.32199 3.30519
Proportion of Variance 0.00812 0.00795 0.00792 0.00748 0.00736 0.00724 0.00706 0.00667 0.00653 0.00648 0.00629 0.00623
Cumulative Proportion 0.75254 0.76049 0.76841 0.77589 0.78325 0.79049 0.79755 0.80422 0.81075 0.81723 0.82352 0.82975
PC49 PC50 PC51 PC52 PC53 PC54 PC55 PC56 PC57 PC58 PC59 PC60
Standard deviation 3.27842 3.2430 3.21268 3.18443 3.17152 3.09140 3.05199 3.04186 3.03361 2.99736 2.98039 2.94492
Proportion of Variance 0.00613 0.0060 0.00588 0.00578 0.00573 0.00545 0.00531 0.00528 0.00525 0.00512 0.00506 0.00494
Cumulative Proportion 0.83588 0.8419 0.84776 0.85354 0.85928 0.86472 0.87003 0.87531 0.88056 0.88568 0.89074 0.89569
PC61 PC62 PC63 PC64 PC65 PC66 PC67 PC68 PC69 PC70 PC71 PC72
Standard deviation 2.92100 2.90968 2.86032 2.8395 2.82278 2.79180 2.74172 2.73852 2.70586 2.68556 2.64009 2.60152
Proportion of Variance 0.00486 0.00483 0.00466 0.0046 0.00454 0.00444 0.00429 0.00428 0.00417 0.00411 0.00397 0.00386
Cumulative Proportion 0.90055 0.90538 0.91004 0.9146 0.91918 0.92363 0.92791 0.93219 0.93636 0.94047 0.94445 0.94831
PC73 PC74 PC75 PC76 PC77 PC78 PC79 PC80 PC81 PC82 PC83 PC84
Standard deviation 2.58597 2.57208 2.52419 2.49683 2.45577 2.43039 2.4041 2.36481 2.31208 2.28884 2.24954 2.20218
Proportion of Variance 0.00381 0.00377 0.00363 0.00355 0.00344 0.00337 0.0033 0.00319 0.00305 0.00299 0.00289 0.00276
Cumulative Proportion 0.95212 0.95589 0.95952 0.96308 0.96652 0.96988 0.9732 0.97637 0.97941 0.98240 0.98529 0.98805
PC85 PC86 PC87 PC88 PC89 PC90
Standard deviation 2.18536 2.1373 2.06398 1.94600 1.88876 7.458e-15
Proportion of Variance 0.00272 0.0026 0.00243 0.00216 0.00203 0.000e+00
Cumulative Proportion 0.99077 0.9934 0.99581 0.99797 1.00000 1.000e+00
And the image looks like this :
My Questions are :
1) Why the PC are the samples? Shouldn't they be the genes? If I do var.axes=TRUE
it actually shows me the arrows that are the genes so is correct but I did not get this point.
2) How can I get the genes that make my samples differ by condition the best from the PCA that i have computed?
If I do not transpose the matrix :
z <- counts.tmm # samples are in the columns and genes as rows
pca <- prcomp(z, center = TRUE,scale. = FALSE)
summary(pca)
ggbiplot(pca,ellipse=TRUE,choices= c(1,2),var.axes=FALSE)
dev.off()
thats what I get :
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 13.0772 4.07904 2.49366 2.04542 1.82674 1.50391 1.46282 1.41308 1.39098 1.33582 1.32702 1.30732
Proportion of Variance 0.5983 0.05821 0.02175 0.01464 0.01167 0.00791 0.00749 0.00699 0.00677 0.00624 0.00616 0.00598
Cumulative Proportion 0.5983 0.65649 0.67825 0.69288 0.70456 0.71247 0.71996 0.72694 0.73371 0.73996 0.74612 0.75210
PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24
Standard deviation 1.29400 1.28199 1.26799 1.25235 1.24683 1.21349 1.19880 1.18708 1.18165 1.17765 1.16516 1.15687
Proportion of Variance 0.00586 0.00575 0.00562 0.00549 0.00544 0.00515 0.00503 0.00493 0.00488 0.00485 0.00475 0.00468
Cumulative Proportion 0.75795 0.76370 0.76933 0.77482 0.78025 0.78541 0.79043 0.79536 0.80025 0.80510 0.80985 0.81453
PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36
Standard deviation 1.15105 1.13993 1.12768 1.1218 1.11720 1.10388 1.09192 1.08421 1.07808 1.07271 1.06507 1.05828
Proportion of Variance 0.00464 0.00455 0.00445 0.0044 0.00437 0.00426 0.00417 0.00411 0.00407 0.00403 0.00397 0.00392
Cumulative Proportion 0.81917 0.82371 0.82816 0.8326 0.83693 0.84119 0.84537 0.84948 0.85354 0.85757 0.86154 0.86546
PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48 PC49
Standard deviation 1.05420 1.0415 1.03585 1.02542 1.0145 1.00792 0.99140 0.97666 0.97385 0.96267 0.9568 0.95232 0.94773
Proportion of Variance 0.00389 0.0038 0.00375 0.00368 0.0036 0.00355 0.00344 0.00334 0.00332 0.00324 0.0032 0.00317 0.00314
Cumulative Proportion 0.86935 0.8731 0.87689 0.88057 0.8842 0.88773 0.89117 0.89450 0.89782 0.90106 0.9043 0.90744 0.91058
PC50 PC51 PC52 PC53 PC54 PC55 PC56 PC57 PC58 PC59 PC60 PC61
Standard deviation 0.9416 0.93214 0.92504 0.91419 0.91151 0.89366 0.88537 0.88206 0.87722 0.86586 0.86438 0.85371
Proportion of Variance 0.0031 0.00304 0.00299 0.00292 0.00291 0.00279 0.00274 0.00272 0.00269 0.00262 0.00261 0.00255
Cumulative Proportion 0.9137 0.91672 0.91972 0.92264 0.92555 0.92834 0.93108 0.93381 0.93650 0.93912 0.94173 0.94428
PC62 PC63 PC64 PC65 PC66 PC67 PC68 PC69 PC70 PC71 PC72 PC73
Standard deviation 0.8449 0.84420 0.83856 0.83176 0.82325 0.80923 0.7924 0.78655 0.78025 0.77028 0.76573 0.75772
Proportion of Variance 0.0025 0.00249 0.00246 0.00242 0.00237 0.00229 0.0022 0.00216 0.00213 0.00208 0.00205 0.00201
Cumulative Proportion 0.9468 0.94928 0.95174 0.95416 0.95653 0.95882 0.9610 0.96318 0.96531 0.96738 0.96944 0.97144
PC74 PC75 PC76 PC77 PC78 PC79 PC80 PC81 PC82 PC83 PC84 PC85
Standard deviation 0.75455 0.74868 0.74169 0.73250 0.73053 0.72453 0.71426 0.70595 0.6973 0.68962 0.6761 0.66666
Proportion of Variance 0.00199 0.00196 0.00192 0.00188 0.00187 0.00184 0.00178 0.00174 0.0017 0.00166 0.0016 0.00155
Cumulative Proportion 0.97344 0.97540 0.97732 0.97920 0.98107 0.98290 0.98469 0.98643 0.9881 0.98980 0.9914 0.99295
PC86 PC87 PC88 PC89 PC90
Standard deviation 0.66129 0.6555 0.64137 0.61824 0.59554
Proportion of Variance 0.00153 0.0015 0.00144 0.00134 0.00124
Cumulative Proportion 0.99448 0.9960 0.99742 0.99876 1.00000
But the list of PC is always 90. Now here it makes sense that they are 90 (samples) but before I don't get why is 90 when the genes are thousands.
4) Last question , for my purpose (obtain genes that shows more difference between my samples), is it more suitable the first (PC as genes) or the second (PC as samples)?
From pcaExplorer I have seen that they relate this last PCA (samples as columns and genes as rows) with the study of the Heatmap so I assume that this last one is the one to use for gene view? But I don't get why.
I have actually seen a video where it describe really well PCA and it matches with what you actually say. So yes samples on rows and genes on columns will probably give me what I actually want.
Great!
And about those other techniques... you said you're looking for "the genes that make my samples differ by condition the best." PCA looks for a couple of dimensions out of many that best spread out the samples, while something like LDA (linear discriminant analysis) would look for dimensions that best spread out the supplied groups of samples, so maybe that'd be something to look at. There are nonlinear methods too, like t-SNE (t-distributed stochastic neighbor embedding). Here's a geometry-focused comparison between PCA/LDA/t-SNE that at a glance looks like an intuitive explanation.