Hi all,
I'm looking for covariates in my expression dataset (countdata CAGEseq). I have 120 samples and about 20.000 expression values. I found this tutorial that is really easy to use:
https://tgmstat.wordpress.com/2013/11/28/computing-and-visualizing-pca-in-r/
So I changed it for my data with 120 samples columns and 20.000 rows.
pca_data = t(log(norm.data+1))
dim(pca_data)
[1] 120 20000
cage.pca <- prcomp(pca_data,
center = TRUE,
scale. = TRUE)
# plot method
plot(cage.pca, type = "l")
# summary method
summary(cage.pca)
Look like to me if I use the first 6 PCAs then most of the variation is gone.
however when I do the summary method it gives out 119 PCA components? I am a bit confused now and don't know which PCA components I need to use as covariates. And the cumulative proportion of PCA6 is 0.43128 not like in the plot where I would expect a lot more... Could anyone help with this?
you don't need to use 6 principal components. from the scree plot you can see that 3 capture most of the variability, adding 3 more don't really add that much.
Ok thanks. For extracting these components I can just do this right?
Were you concerned because plot() only showed 10 components and summary() showed 119 ? The reason is that by default, plot() shows at most 10 components. So although it shows that the first 3-6 components explain a large amount of variance, it is a bit misleading because a lot of the variance is also captured in the components not shown, summary() shows the cumulative variance explained and tells you that the first 6 components only explain ~43% of the variance.