I need help to figure out what is an adequate statistical approach to tackle the question if two conditions are different mean values taking into account variability of multiple genes.
Two conditions, control and treatment, with 4 and 2 biological replicates respectively (don't say anything, I know), and for each a value of expression for a few thousand genes, the value in this case a ratio:
set.seed(100)
dat <- data.frame(
condition = c(rep("control", 40), rep("treatment", 20)),
gene = paste("gene_", rep(letters[1:10], 6), sep = ""),
ratio = sample(x = 0:100, size = 60,replace = TRUE) / 100
)
dat <- dat[order(dat$gene), ]
dat$replicate <- rep(c(1:4, 1:2), times = 10)
dat <- within(dat, {
gene <- factor(gene)
replicate <- factor(replicate)
condition <- factor(condition)
})
head(dat)
The question is: is there a difference in the mean ratio between treatment vs control, but taking into account the variability of each gene.
My first though is a repeated measures one-way anova and the term Error
to account of variance in the replicates:
dat_aov <- with(myData.mean,
aov(stress ~ music * image +
Error(PID / (music * image)))
)
dat_aov <- aov(ratio ~ condition + Error(gene/ratio), data=dat)
summary(dat_aov)
But I am not sure if I am constructing this correctly or if this is the right approach.
Edit
This is RNA-seq data but:
- the values are strand ratio, % of reads mapped in the + strand, and
- I am no so much interested in which genes change this % but if there is a global effect.
I did consider limma
or DESeq
but because I don't think they will give me answer for 2 nor work with data which is not count, I discarded them. Happy to be shown that I am wrong on this.
Is this RNA-seq, if so why not using DESeq2/edgeR/limma-voom? Edit: Ah rpolicastro had the same thought :)
It is RNA-seq but (I) I am looking strand ratio (say % of reads mapped in the + strand) and (ii) I am no so much interested in which genes change this % but if there is a global effect. I will edit my post.
Is this RNA-seq? If so, use a program like edgeR or DESeq2 for expression analysis.
See edit to my post.
If the data is continuous, I think "standard" limma could handle it (not voom). You should check you meet linear assumptions and maybe have to transform the data. For example your ratio being bound 0 to 1 can be turned to a nice gaussian by a logit transformation (this is often done when using limma to analyze (0,1) methylation values, but to gain homoscedasticity). (by logit I mean
log2(ratio/(1 - ratio))
).What do you mean by being interested in "global effect"? If you mean that you are interested in how many genes display changes maybe it would be better to plot distributions of statistics across all the genes?
The buzzword to google for A. Domingues would be "limma-trend" pipeline I guess.
I think simply use "limma", because I seem to recall the "trend" option was also developed to start from RNA-seq data and model the typical mean-variance heteroskedasticity (even if the count data were transformed "on the way")? (I have not used it, but see this). It seems to me that the data to be analyzed here is much different (even if it at some point it was created out of RNA-seq counts)
After all, limma is a linear model framework to analyze continuous data, I have seen it used for other kinds of data such as log-transformed metabolomics, etc. I would say the way to go is to: carefully inspect the data distribution at hand, and if it seems to fit standard limma, do standard limma.
Edit:
Also, for modelling variance at different levels within limma, in general, I can recall these different approaches which may be worth reading into:
arrayWeights()
: for accounting for variance of SAMPLES for example outliers without filtering them outeBayes(robust = T)
: for accounting for variance of GENES (hypervariable/hypovariable) also used in RNA-seq analysislmFit(method = “robust”)
: for accounting for outlier OBSERVATIONS within genes (performs robust linear regression, but is in disuse)eBayes(trend = T)
: for accounting for cases when the variance of the observations depends on the mean also used in RNA-seq analysisI will look it up. Cheers.
Thanks of the reply. It is indeed a ratio (0 to 1). I am simply interested in knowing if there is a change in the mean ratio between conditions - for instance the mean ratio 0.5 in control vs 0.6 in condition.
My strategy was first to average the ratios for all genes per sample, and then t.test condition vs control. Then I thought of averaging the replicates for each gene and plotting the distribution, testing differences with an non-parametric test (I think this goes in the direction you suggested).
But in these approaches the information of the gene expression variability of each gene across replicates is lost, so maybe there is a better approach for this.
Well if you pool your measurements of all the genes across the replicates (I mean, concatenate them) instead of averaging the replicates, and compare those two distributions of ratios (control vs treatment), the variability of the replicates for each gene will somewhat be there... but I admit it may not be the best method (and I don't know if I fully understood the goal: what I describe would examine the question of: "are the ratios of genes for one strand usually higher/lower in the control samples vs the treatment samples")
(Nonetheless the large number of points in both distributions will make any comparison to always be statistically significant, so it would be better to just use a descriptive measure of effect size)
That is pretty much want I want to know :) And still don't think
limma
is the solution here because I will get a result per gene, not to the question above (unless I missing something).