I am looking for input on ways to put numbers on how much different factors/covariates contribute to gene expression patterns. For example, let's say we have RNA-seq (or microarray) data from male and female mice that have received two different treatments (or no treatment) and the samples have been prepped in two different batches. Now we would like a bar plot (or similar) of how much the different covariates (1) sex, (2) treatment and (3) batch effects contribute, respectively, to the gene expression.
Some ideas:
- Use PCA on the expression matrix and calculate the correlation between PC1 scores and the covariate vectors.
- Use some machine learning algorithm to try to discriminate between male/female (or treatments, or batches), and evaluate by cross-validation how well this is working. Use the prediction performance as a measure of how strongly the covariate is reflected in the expression profile.
- Is there a way to find out this sort of summarized importance from the linear models fitted as part of limma/edgeR/DESeq workflows?
I'd be grateful for any thoughts.
Do you want this for each gene or some sort of summarized metric for the whole experiment (I'm guessing the latter from your mention of just using PCA)?
Yes, a summarized metric for the whole experiment.