I have an RNAseq dataset that I want to filter strongly on the 5000 most variable genes. What I want to do is:
perform Size estimation with DESeq2::estimateSizeFactor
transform to gaussion distribution with DESeq2::rlog
filter the most variable genes with rowVars
Do I perform the filtering step before or after the transformation step? I tried both and it gave me varying results.
@ATpoint: somehow your answer is not displayed under this thread, but only in my private notifications:
"You would filter for these genes after the transformation because the whole point of the transformation is to unlock the dependency of the variance from the mean (so from the expression level), as you want to filter for "biologically variable" genes that are different between samples and not for high variance due to expression level (which is technical)."
Didn't I account for technical variation with the
SizeFactor
already? I thought transformation is used to meet the requirement of gaussion distribution of most statistical tests and not to normalize for technical biases. As such, I would expect to have a strong agreement of the most variable genes either way they are computed formsizeFactor
normalized transformed or untransformed counts.The normalization via size factors accounts for differences in sequencing depth and library composition. The log2 is necessary (or vst/rlog) to remove dependency of variance from mean, see answer from @yoogstrate and my comment.
See also for the normalization itself: