Question

Best Way To Quantify Influence Of Different Covariates On Gene Expression?

7

Entering edit mode

10.7 years ago

Mikael Huss 4.8k

I am looking for input on ways to put numbers on how much different factors/covariates contribute to gene expression patterns. For example, let's say we have RNA-seq (or microarray) data from male and female mice that have received two different treatments (or no treatment) and the samples have been prepped in two different batches. Now we would like a bar plot (or similar) of how much the different covariates (1) sex, (2) treatment and (3) batch effects contribute, respectively, to the gene expression.

Some ideas:

Use PCA on the expression matrix and calculate the correlation between PC1 scores and the covariate vectors.
Use some machine learning algorithm to try to discriminate between male/female (or treatments, or batches), and evaluate by cross-validation how well this is working. Use the prediction performance as a measure of how strongly the covariate is reflected in the expression profile.
Is there a way to find out this sort of summarized importance from the linear models fitted as part of limma/edgeR/DESeq workflows?

I'd be grateful for any thoughts.

rna-seq microarray • 13k views

ADD COMMENT • link updated 8.6 years ago by chris86 ▴ 400 • written 10.7 years ago by Mikael Huss 4.8k

0

Entering edit mode

Do you want this for each gene or some sort of summarized metric for the whole experiment (I'm guessing the latter from your mention of just using PCA)?

ADD REPLY • link 10.7 years ago by Devon Ryan 105k

0

Entering edit mode

Yes, a summarized metric for the whole experiment.

ADD REPLY • link 10.7 years ago by Mikael Huss 4.8k

2

Entering edit mode

10.7 years ago

brentp 24k

This is a good question, in ecology, for example, one might have a single set of data and they will scrutinize a model for years and find the "best" one. In genomics, we choose one model and apply it to N (million) probes / measurements / genes / CpG's / SNPs etc.

You could look at the distribution of p-values, either of your covariate of interest or of the potential confounder with and without the confounder. You can also look at the distribution of covariate estimates. If you see genomic inflation without correction for some important covariate, then that might be a good thing to check.

You could also do something like SVA or PEER on a simple model and see how the inferred surrogate variables correlate with the covariates that you did not include in the model (this is similar to your PCA idea).

Of course, it could be that different covariates only have an effect at certain sites in the genome--we know this will be true in the case of, gender and the sex chromosomes. For other effects, it will be more subtle. In the end, you will have to rely on biological insight to a large extent.

ADD COMMENT • link 10.7 years ago by brentp 24k

0

Entering edit mode

The SVA idea is interesting - I have been playing around with this package recently but hadn't thought to apply it to this question.

ADD REPLY • link 10.7 years ago by Mikael Huss 4.8k

1

Entering edit mode

10.7 years ago

Charles Warden 8.3k

I would check the signal distributions: if the male and female samples look very different (for example, I have seen this be the case for germ-line samples), I would just analyze them separately.

The most direct answer to your question is that you can use a 2-way or 3-way ANOVA to factor out the influence of covariates. It's not super rigorous, but I think you can use something like the average sum of squares to give an idea about the relative contribution. For example, I know Partek makes a sources of variation box-plot like you described, and I think that is how they do it.

DESeq, edgeR, etc. also have workflows for multivariate analysis (I believe using a generalized linear model in both of those cases).

ADD COMMENT • link 10.7 years ago by Charles Warden 8.3k

1

Entering edit mode

Thanks for your comments. I have used DESeq, edgeR and limma to look at these data, but I am not sure what is the best way to use information about the fitted coefficients to something in general about the covariates. I should say that in my case the covariates are not in fact sex, treatment and batch but three other things; those were just an example.

ADD REPLY • link 10.7 years ago by Mikael Huss 4.8k

1

Entering edit mode

8.6 years ago

chris86 ▴ 400

Edit: This is much better. https://github.com/dswatson/bioplotr/blob/master/R/plot_drivers.R

ADD COMMENT • link 5.3 years ago by chris86 ▴ 400

1

Entering edit mode

Hi @chris86, I had a look at this package, and I believe it allows to assess the effect of factors / categorical variables (e.g. sex) on RNA-seq data, but it does not allow to assess the effect of numerical covariates (e.g. weight) on the data.

ADD REPLY • link 5.3 years ago by rodd ▴ 250

0

Entering edit mode

This is what you want: https://github.com/dswatson/bioplotr/blob/master/R/plot_drivers.R.

I suggest everyone check out this function for assessing the drivers of variation, it requires installation of David Watsons bioplotr package.

ADD REPLY • link 5.3 years ago by chris86 ▴ 400

score 11 · Accepted Answer · 2014-04-11

11

Entering edit mode

10.7 years ago

Irsan ★ 7.8k

As a first step, it might be useful to remove genes with count 0 in more than 90% of your samples. Then obtain normalized and log-transformed expression estimates (logFPKM; maybe edgeR and voom() can be of use?) for all your samples and genes. I assume you already have obtained them somehow. Then prepare your data in such a way that you have a data frame where a row represents 1 measurement and the columns geneId, sex, treatment, batch, logFpkm (maybe aggregate() or melt() functions can be useful to reshape your data). Then perform anova and look at the sums of squares and the p-values of the F-test. Most likely, geneId has the biggest influence on your logFpkms and hopefully treatment is a good second.

fit <- lm(logFpkm ~  geneId + sex + treatment + batch,data=yourData)
anova(fit)

When you have many samples and many genes you can also take a random sample of your genes (say 1000) to get an idea and then scale up to bigger sample sizes.

Edit: Brentp, you are right. Its maybe better to exclude the geneId from the linear model

ADD COMMENT • link 8.6 years ago by Irsan ★ 7.8k

0

Entering edit mode

you mean to have a single table with n_samples * n_genes and fit one model to that?

ADD REPLY • link 10.7 years ago by brentp 24k

1

Entering edit mode

so take an expression matrix, melt it, add covariate info and fit model / do anova. Maybe there are better alternatives but it comes into my mind. I just tried it on one of my datasets and it said that geneId is by far the biggest source of variance, then tissue, then patientGroup, then batch which makes sense for my dataset.

ADD REPLY • link 10.7 years ago by Irsan ★ 7.8k

0

Entering edit mode

Thanks for the tip. I had been thinking along these lines but hadn't hatched the idea about using geneId as a feature. I'll definitely try it.

ADD REPLY • link 10.7 years ago by Mikael Huss 4.8k

0

Entering edit mode

you can also leave it out of the model of course. Its not surprising that it is the biggest source of variation.