Hi all,
I have gene expression for say 9 different samples, all the same age, for say 3 different tissues (see table below, I have three samples from hippocampus, three from cerebellum and three from blood). I want to put these into a PCA plot to see if the gene expression for similar tissues clumps together in the plot.
The problem is that I have overlapping, but not completely overlapping, gene sets between the samples.
So essentially I have a table like this (Hipp= hippocampus, Cer = Cerebellum, Bloo = Blood);
Hipp1 Hipp2 Hipp3 Cer1 Cer2 Cer3 Bloo1 Bloo2 Bloo3
Gene1 0.01 0.01 0.02 0.01 0.03 0.05 1.0 1.2 1.3
Gene2 - - - 0.1 0.2 0.3 1.2 1.4 1.5
Gene3 0.54 0.4 0.4 0.4 0.4 0.5 0.5 1.2 3.4
Gene4 1.2 1.3 4.4 1.3 3.4 3.1 - - -
Gene5 - - - - - - 0.3 0.3 0.2
So for example here, since hippocampus and cerebellum are both parts of the brain, and the blood is different, and all the samples are the same age, I want to know if the brain samples cluster together on a PCA plot more than say hippocampus and blood. But here, I don't have expression data for gene 2 for the hippocampus, and I don't have data for gene 4 for blood, or gene 5 I have no data for the brain.
So I'm wondering if there is a package where I can read in the above table, and it will give me back a PCA plot of the genes, considering that there is some missing data. I do not have the option to remove genes with some missing data, because in reality, I have >10,000 samples and 20,000 genes, so there is at least one missing value for all the genes (obviously, I will consider filtering the genes with almost exclusively missing data).
I'm sorry that I have no code to show, I don't know where to start/what package to use so this is just a thought experiment at the minut, will post the code if I can get something working.
Since numbers are read counts, why not go with missing == 0 counts?
The numbers are log2 transformed microarray data (actually in this example I just want to show people what I mean so the numbers are made up), but I can't set them to 0, since maybe there is a high expression for that gene, I just don't have the info for it?
So I suppose I'm wondering if there's a PCA package in either python or R, where I can read in the above table, and then there's a parameter that says something like "set missing data to NA"", and then before inputting my table into the package, I would just change all the "-" to NA.
I appreciate the help, but maybe the answer is not as trivial as I thought, i've just found this, so maybe using PCA with explicitly defined missing data is not possible, and I have to somehow impute the data.
Missing data is not an issue specific to PCA. However, the Bioconductor package pcaMethods implements various algorithms to deal with missing data.
Wait, how did you get the data then?
because it's a combination of a couple of three different data sets (i.e. two brain parts and blood) that are comparable. Thank you for the help, I appreciate it.
But if the RNA-seq (I suppose) was performed in all three tissues then a highly expressed gene would be picked up, no?