Hi.
First of all i don't know if the title of the post is quite right but i couldn't though how to express it.
So let's assume that i have this table with expression results from two experiments (Control (10x) and Cancer (13x))
| Ge/treat | Control_1 | Control_2 | Cancer_1 | Cancer_2 | Cancer_3 |
|----------|:-------------:|----------:|----------:|---------:|---------:|
| gene1 | 2.65 | 3.01 | 2.20 | 3.65 | 4.01 |
| gene2 | 1.54 | 1.27 | 2.01 | 2.65 | 3.11 |
| gene3 | 1.34 | 1.00 | 2.50 | 1.65 | 2.01 |
and i want to run a PCA analysis on them. TIll now every tutorial i read is having as first step the transposition of the array in order for columns to become rows and rows columns. By following them, I'm getting a plot where dots are having the column names (Control_1 , Control_2 , Cancer_1 etc.) for labels while the eigenvector are represented from gene names (gene1,gene2 gene3 etc..).
What i actually want to do is the opposite one. I want Control_1 , Control_2 , Cancer_1, Cancer_2 and Cancer_3 to be my eigenvectors and dots to be the expression values of the genes. In that way i want to see if for example expressions of some genes in Cancer mode are grouped together. After trying many different ideas finally i couldn't figure out how to achieve that.
Here I post also, the code I used to produce the first plot that i described
# transpose the data frame
pcaData = as.data.frame(t(pcaData))
# add new column with the type of experiment (Control ,Cancer)
pcaData["type"] = c(rep("Control",10),rep("Cancer",13))
autoplot(prcomp(pcaData[,1:23]),
data = pcaData,
colour = 'type',
label = TRUE,
label.size = 3,
loadings.label = TRUE,
loadings.label.size = 3
)
So how can i compute the opposite PCA ? Is the transposition of the matrix needed or not ? Any idea,hint or resource on how to approach such a target will be very helpful.
Thank you.
Why don't you use row hierarchical clustering if you are only interested in seeing the cluster of genes that discriminate Control Vs Cancer? PC component analysis is usually done for reducing the dimensions of a multi-dimensional problem and then project it in 2 principal axes to see the effect. For example, if you have 1000 genes and 10 samples, it is difficult to visualize how the samples differ according to all 1000 genes, but if project along 2-PC-axes where the variation is maximum, you might clearly see how they vary. And you will have 10 points on the PC-place corresponding to 10-samples. Now if you do "opposite" PC, you will get 1000 points of genes; that I don't know solves or complicates your problem even further!
Yes you are right that if you have 18 thousand genes there must be a mess. But what if you have only 100 or 200 genes after a differential expression analysis ? I think that this would be a more clear plot. Anyway. The think is if such a plot is possible to be created.
I don't see any problem in creating such a plot. Conceptually, and following my earlier analogy, you are trying to plot 1000 points in a 10-dimensional space (instead of 10-points in 1000 dimensional space). Although how much PCA can resolve the difference among these 1000 points (ie. the difference explained by each principal component) has to be checked. My guess is that it will resolve very little difference. And I'll still suggest hierarchical clustering of genes.
Ok, now coming to your problem: You can follow exactly this https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html
You don't need to transpose the matrix. Try to see the similarity between the iris data plotted there and your own data.
I promise to try the hierarchical clustering :-). As for this specific link i want to tell you that i have already saw it. The difference with that dataset and mine is that iris dataset doesn't seem to have any replication for its samples and also has a last column called 'species' that is used to distinguish the groups later with the color. I don't have such column. So the problem is how to create such a data frame (like iris) with my dataset.
Hmm, i see your problem now.. So the name of your Cancer / Control should be rownames.
Obviously, the genes are in column and Cancer / Caontrol are in rows for above to work.
By saying
doesn't mean that a transposition of the initial data frame is needed ? Anyway. To be clear i got a little bit more confused now. Either you did what I have already done and posted at the initial post or you did something that i couldn't understand. So if you have time and you want please, post a more completed code. And thanks a lot for this conversation :-)
yes, you are right, I'm messing up everything :) I'll post complete code once my head is free a bit, as I think I am getting the idea what you want.