Question

Combining gene expression data into a principal components analysis - how to deal with missing data?

0

Entering edit mode

7.7 years ago

Tom ▴ 50

Hi all,

I have gene expression for say 9 different samples, all the same age, for say 3 different tissues (see table below, I have three samples from hippocampus, three from cerebellum and three from blood). I want to put these into a PCA plot to see if the gene expression for similar tissues clumps together in the plot.

The problem is that I have overlapping, but not completely overlapping, gene sets between the samples.

So essentially I have a table like this (Hipp= hippocampus, Cer = Cerebellum, Bloo = Blood);

          Hipp1     Hipp2    Hipp3   Cer1   Cer2   Cer3   Bloo1   Bloo2   Bloo3
Gene1      0.01     0.01     0.02    0.01   0.03   0.05   1.0     1.2     1.3
Gene2      -        -        -       0.1    0.2    0.3    1.2     1.4     1.5 
Gene3      0.54     0.4      0.4     0.4    0.4    0.5    0.5     1.2     3.4
Gene4      1.2      1.3      4.4     1.3    3.4    3.1    -       -       -        
Gene5      -        -        -       -      -      -      0.3    0.3      0.2

So for example here, since hippocampus and cerebellum are both parts of the brain, and the blood is different, and all the samples are the same age, I want to know if the brain samples cluster together on a PCA plot more than say hippocampus and blood. But here, I don't have expression data for gene 2 for the hippocampus, and I don't have data for gene 4 for blood, or gene 5 I have no data for the brain.

So I'm wondering if there is a package where I can read in the above table, and it will give me back a PCA plot of the genes, considering that there is some missing data. I do not have the option to remove genes with some missing data, because in reality, I have >10,000 samples and 20,000 genes, so there is at least one missing value for all the genes (obviously, I will consider filtering the genes with almost exclusively missing data).

I'm sorry that I have no code to show, I don't know where to start/what package to use so this is just a thought experiment at the minut, will post the code if I can get something working.

pca missing data gene expression r • 2.3k views

ADD COMMENT • link 7.7 years ago by Tom ▴ 50

0

Entering edit mode

Since numbers are read counts, why not go with missing == 0 counts?

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

The numbers are log2 transformed microarray data (actually in this example I just want to show people what I mean so the numbers are made up), but I can't set them to 0, since maybe there is a high expression for that gene, I just don't have the info for it?

ADD REPLY • link 7.7 years ago by Tom ▴ 50

0

Entering edit mode

So I suppose I'm wondering if there's a PCA package in either python or R, where I can read in the above table, and then there's a parameter that says something like "set missing data to NA"", and then before inputting my table into the package, I would just change all the "-" to NA.

ADD REPLY • link 7.7 years ago by Tom ▴ 50

0

Entering edit mode

I appreciate the help, but maybe the answer is not as trivial as I thought, i've just found this, so maybe using PCA with explicitly defined missing data is not possible, and I have to somehow impute the data.

ADD REPLY • link 7.7 years ago by Tom ▴ 50

1

Entering edit mode

Missing data is not an issue specific to PCA. However, the Bioconductor package pcaMethods implements various algorithms to deal with missing data.

ADD REPLY • link 7.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

since maybe there is a high expression for that gene, I just don't have the info for it?

Wait, how did you get the data then?

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

because it's a combination of a couple of three different data sets (i.e. two brain parts and blood) that are comparable. Thank you for the help, I appreciate it.

ADD REPLY • link 7.7 years ago by Tom ▴ 50

0

Entering edit mode

But if the RNA-seq (I suppose) was performed in all three tissues then a highly expressed gene would be picked up, no?

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k