This is a good question and you were wise to ask it because I don't think the obvious approach is the correct one.
For RNA Seq, the observed variance that you get from calculating the variance in the replicates will have a mixture of biological + technical variance.
If the libraries are good, the total variance you observe in biological replicates will be as follows:
Total variance=Biological variance + library prep and sequencing variance + Poisson variance (shot noise, counting noise) from sampling the molecules
To isolate the biological variance I would:
1) Normalize the data
2) Calculate the total variance as the variance of the normalized counts across replicates
3) ignore the library prep and sequencing variance by assuming it is zero (which is usually OK, you can also estimate it from technical replicates if you have them)
4) Estimate the poisson variance as the mean of the expression. This is because the variance of a Poisson distribution is the mean of that distribution.
5)Therefore, the biological variance of the gene can be calculated as:
Total variance=Biological variance + [zero for technical]+the observed mean expression [for Poisson]
Biological variance = Total variance - the observed mean expression
6) Then you can sort by the isolated biological variance. If you don't do this you will see a lot more variance in low expression genes, and low expression genes will appear to be more variable* just because the noise is higher (i.e. 1 vs 2 is more likely to vary as an artifact of sampling noise than 100 vs 200, which is certainly not counting noise).
This blog explains the ideas better:
http://michelebusby.tumblr.com/post/26913184737/thinking-about-designing-rna-seq-experiments-to
The supplement here goes into depth on all the formulas:
https://academic.oup.com/bioinformatics/article/29/5/656/252753
You don't need to read the paper (trust me, I wrote it), just the first section of the supplement.
Supplement 1. Sources of variance in expression measurements
Here is a toy example in python. You can also do it in Excel (don't at me):
import numpy as np
measurements=[5, 10, 18]
v=np.var(measurements)
print("Observed variance:", v)
m=np.mean(measurements)
print("Mean expression:", m)
poissonVariance=m
biologicalVariance=v-poissonVariance
print("Isolated Biological Variance:", biologicalVariance)
Observed variance: 28.666666666666668
Mean expression: 11.0
Isolated Biological Variance: 17.666666666666668
*Itay Tirosh has some old papers in yeast looking at variability in expression and what it is correlated with but I can't remember all the details.
Hello Michele, If I understand your approach correctly, you are calculating the total variance on assumption that the data points( normalized read counts for a certain gene across replicates) follow a normal distribution and then calculating the shot noise on assumption that the same data points follow a Poisson distribution? On another note, are three replicates enough to make assumptions about the total variance for a given gene read counts ? Thank you
I don't think you need to assume a normal distribution but I haven't looked at the theory in a while. The main thing is that you can add uncorrelated variances to get the total variance. They are probably something like a lognormal distribution but the tail is constrained by biology so you can usually assume a normal, in my opinion, and that's what I did for the application we built. Three replicates isn't great but it's what you have.