In the voom article it is stated that log-cpm of RNA-Seq data can be treated as analogous to values from a microarray experiment, with the difference that values from RNA-Seq data do not have constant variances. I understand constant variances here to mean that as the mean changes the variance does not.
1) Why is it required that the variance is constant across the mean?
2) Why assuming the variance to be constant works in microarrays?
I'm interested in a clustering analysis, not in differential expression. Since the constant variance requirement is needed (only) for the linear modeling part, is voom needed at all for clustering?
Yes, because clustering is driven by high variance features and so you need to stabilize the variance. In DE the worry is that different subjects are in different variance regimes, in clustering the worry is that different genes are in different variance regimes. We tend to use
rlog
from the DESeq2 package before doing things like clustering, I don't know if voom would be equally effective, but my guess is yes.This sounds true in theory, but Gordon Smith says here that the precision weights generated by voom cannot be easily combined with the gene counts.
In which case I recommend using
rlog
orvst
then.rlog
is a bit likecpm(counts, log=TRUE, prior.count=3)
except the prior count is calculated in a principled way separately for each gene.