Question

How to construct an anova to test diferences bewteen conditions taking into account gene expression variability?

0

Entering edit mode

4.4 years ago

A. Domingues ★ 2.7k

I need help to figure out what is an adequate statistical approach to tackle the question if two conditions are different mean values taking into account variability of multiple genes.

Two conditions, control and treatment, with 4 and 2 biological replicates respectively (don't say anything, I know), and for each a value of expression for a few thousand genes, the value in this case a ratio:

set.seed(100)

dat <- data.frame(
    condition = c(rep("control", 40), rep("treatment", 20)),
    gene = paste("gene_", rep(letters[1:10], 6), sep = ""),
    ratio = sample(x = 0:100, size = 60,replace = TRUE) / 100
)

dat <- dat[order(dat$gene), ]
dat$replicate <- rep(c(1:4, 1:2), times = 10)

dat <- within(dat, {
  gene   <- factor(gene)
  replicate   <- factor(replicate)
  condition <- factor(condition)
})
head(dat)

The question is: is there a difference in the mean ratio between treatment vs control, but taking into account the variability of each gene.

My first though is a repeated measures one-way anova and the term Error to account of variance in the replicates:

dat_aov <- with(myData.mean,
                   aov(stress ~ music * image +
                       Error(PID / (music * image)))
)

dat_aov <- aov(ratio ~ condition + Error(gene/ratio), data=dat)
summary(dat_aov)

But I am not sure if I am constructing this correctly or if this is the right approach.

Edit

This is RNA-seq data but:

the values are strand ratio, % of reads mapped in the + strand, and
I am no so much interested in which genes change this % but if there is a global effect.

I did consider limma or DESeq but because I don't think they will give me answer for 2 nor work with data which is not count, I discarded them. Happy to be shown that I am wrong on this.

Anova statistics • 2.3k views

ADD COMMENT • link 4.4 years ago by A. Domingues ★ 2.7k

1

Entering edit mode

Is this RNA-seq, if so why not using DESeq2/edgeR/limma-voom? Edit: Ah rpolicastro had the same thought :)

ADD REPLY • link 4.4 years ago by ATpoint 87k

0

Entering edit mode

It is RNA-seq but (I) I am looking strand ratio (say % of reads mapped in the + strand) and (ii) I am no so much interested in which genes change this % but if there is a global effect. I will edit my post.

ADD REPLY • link 4.4 years ago by A. Domingues ★ 2.7k

1

Entering edit mode

Is this RNA-seq? If so, use a program like edgeR or DESeq2 for expression analysis.

ADD REPLY • link 4.4 years ago by rpolicastro 13k

0

Entering edit mode

See edit to my post.

ADD REPLY • link 4.4 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

If the data is continuous, I think "standard" limma could handle it (not voom). You should check you meet linear assumptions and maybe have to transform the data. For example your ratio being bound 0 to 1 can be turned to a nice gaussian by a logit transformation (this is often done when using limma to analyze (0,1) methylation values, but to gain homoscedasticity). (by logit I mean log2(ratio/(1 - ratio))).

What do you mean by being interested in "global effect"? If you mean that you are interested in how many genes display changes maybe it would be better to plot distributions of statistics across all the genes?

ADD REPLY • link 4.4 years ago by Papyrus ★ 3.0k

0

Entering edit mode

The buzzword to google for A. Domingues would be "limma-trend" pipeline I guess.

ADD REPLY • link 4.4 years ago by ATpoint 87k

1

Entering edit mode

I think simply use "limma", because I seem to recall the "trend" option was also developed to start from RNA-seq data and model the typical mean-variance heteroskedasticity (even if the count data were transformed "on the way")? (I have not used it, but see this). It seems to me that the data to be analyzed here is much different (even if it at some point it was created out of RNA-seq counts)

After all, limma is a linear model framework to analyze continuous data, I have seen it used for other kinds of data such as log-transformed metabolomics, etc. I would say the way to go is to: carefully inspect the data distribution at hand, and if it seems to fit standard limma, do standard limma.

Edit:

Also, for modelling variance at different levels within limma, in general, I can recall these different approaches which may be worth reading into:

arrayWeights(): for accounting for variance of SAMPLES for example outliers without filtering them out

eBayes(robust = T): for accounting for variance of GENES (hypervariable/hypovariable) also used in RNA-seq analysis

lmFit(method = “robust”): for accounting for outlier OBSERVATIONS within genes (performs robust linear regression, but is in disuse)

eBayes(trend = T): for accounting for cases when the variance of the observations depends on the mean also used in RNA-seq analysis

ADD REPLY • link 4.4 years ago by Papyrus ★ 3.0k

0

Entering edit mode

I will look it up. Cheers.

ADD REPLY • link 4.4 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

Thanks of the reply. It is indeed a ratio (0 to 1). I am simply interested in knowing if there is a change in the mean ratio between conditions - for instance the mean ratio 0.5 in control vs 0.6 in condition.

My strategy was first to average the ratios for all genes per sample, and then t.test condition vs control. Then I thought of averaging the replicates for each gene and plotting the distribution, testing differences with an non-parametric test (I think this goes in the direction you suggested).

But in these approaches the information of the gene expression variability of each gene across replicates is lost, so maybe there is a better approach for this.

ADD REPLY • link 4.4 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

Well if you pool your measurements of all the genes across the replicates (I mean, concatenate them) instead of averaging the replicates, and compare those two distributions of ratios (control vs treatment), the variability of the replicates for each gene will somewhat be there... but I admit it may not be the best method (and I don't know if I fully understood the goal: what I describe would examine the question of: "are the ratios of genes for one strand usually higher/lower in the control samples vs the treatment samples")

(Nonetheless the large number of points in both distributions will make any comparison to always be statistically significant, so it would be better to just use a descriptive measure of effect size)

ADD REPLY • link 4.4 years ago by Papyrus ★ 3.0k

0

Entering edit mode

"are the ratios of genes for one strand usually higher/lower in the control samples vs the treatment samples"

That is pretty much want I want to know :) And still don't think limma is the solution here because I will get a result per gene, not to the question above (unless I missing something).

ADD REPLY • link 4.4 years ago by A. Domingues ★ 2.7k