RNAseq: Filtering before or after transformation?
1
0
Entering edit mode
2.5 years ago

I have an RNAseq dataset that I want to filter strongly on the 5000 most variable genes. What I want to do is:

perform Size estimation with DESeq2::estimateSizeFactor

transform to gaussion distribution with DESeq2::rlog

filter the most variable genes with rowVars

Do I perform the filtering step before or after the transformation step? I tried both and it gave me varying results.

DESeq2 transformation RNAseq filtering • 1.7k views
ADD COMMENT
0
Entering edit mode

@ATpoint: somehow your answer is not displayed under this thread, but only in my private notifications:

"You would filter for these genes after the transformation because the whole point of the transformation is to unlock the dependency of the variance from the mean (so from the expression level), as you want to filter for "biologically variable" genes that are different between samples and not for high variance due to expression level (which is technical)."

Didn't I account for technical variation with the SizeFactor already? I thought transformation is used to meet the requirement of gaussion distribution of most statistical tests and not to normalize for technical biases. As such, I would expect to have a strong agreement of the most variable genes either way they are computed form sizeFactor normalized transformed or untransformed counts.

ADD REPLY
1
Entering edit mode

The normalization via size factors accounts for differences in sequencing depth and library composition. The log2 is necessary (or vst/rlog) to remove dependency of variance from mean, see answer from @yoogstrate and my comment.

See also for the normalization itself:

ADD REPLY
3
Entering edit mode
2.5 years ago
yhoogstrate ▴ 150

After, one of the reasons you transform is to stabilize variance. Estimating variance after transformation is more reliable. There's some theory about this in the DESeq manual and in one of the presentations of Simon Anders:

https://bioconductor.org/help/course-materials/2014/CSAMA2014/2_Tuesday/lectures/DESeq2-Anders.pdf

ADD COMMENT
2
Entering edit mode

You can nicely see it with this simple code. Without transformation (here I just use log2) the variance is almost linear to the mean of expression, the transformation removes that bias:

library(DESeq2)

dds <- makeExampleDESeqDataSet(n=5000)
dds <- estimateSizeFactors(dds) 

norm <- counts(dds, normalized=TRUE)
ntd  <- log2(norm+1) # normalized counts and log2 scale

#/ Without transformation variance is somewhat linear to the mean of expression
plot(x=log10(rowMeans(norm)+1), y=log10(rowVars(norm)+1), pch=20)

#/ with transformation that is unlocked
plot(x=rowMeans(ntd), y=rowVars(ntd), pch=20)

enter image description here

ADD REPLY
0
Entering edit mode

thanks for clarification! The pdf helped a lot

ADD REPLY

Login before adding your answer.

Traffic: 2742 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6