Hello everybody.
I want to check if I understand correctly the general statistical procedure carried out in edgeR for determining genes with a significant differential expression between several conditions. I have read both the manual and some of the group's publications such as link. However, I don't know if I have correctly understood the whole process. I am sorry in advance if some of the questions or assumptions are too obvious and incorrect, as I am trying to understand generalized linear models too.
Leaving out certain parts such as removing genes that do not have a minimum expression level and normalizing library sizes, the first thing is to estimate the dispersion of the genes, which is the sum of technical and biological variation. We can assume that all genes have a common dispersion (the average dispersion of all genes), that the dispersion of each gene is different (tagwise), or calculate a dispersion based on the average of the dispersions of genes with a similar level of counts (trended).
Once the dispersion has been estimated, the next step is to fit a GLM for a log-linear model (link function) µgi = x T i βg + log Ni for each gene (sorry for the formula, I don't know how to format it properly here). The aim is to find the parameters of the negative binomial distribution from which the observed counts are most likely to come (a maximum likelihood method is used for this). When the model has been fitted for a gene, we have an estimate of the mean number of reads that should map onto it for each of the conditions considered which, together with the dispersion parameter calculated above, allows us to calculate the variance of the gene counts for each experimental group. Finally, with an F-test, we can check whether there are significant differences between the variances/levels of gene expression.
Is this correct? Thanks in advance.
Hi Gordon, thanks for your response. It seems that I ended up mixing up the concepts of dispersion and CV.
If I understand correctly, the dispersion estimated in the first part of the analysis is given by biological variation only. Then we can use it together with the estimated number of reads that should map onto the gene under the conditions specified to calculate the variances for the F-test.
I am unsure about the context of your question because the paper that you link to doesn't use F-tests. Nevertheless, I have expanded my answer above and the additions will hopefully address your question.
Thanks Gordon