Hello everyone,
I have a question re filtering for low variance prior to WGCNA. I have got RNASeq data, pre-filtered for low counts and transformed with DESeq2 vst. I was wondering if you could help me select from the two methods below the one that is more correct for my data.
filter <- function(x)(IQR(x, na.rm=T)>0.25)
filtered_genes <- genefilter(df,filter)
df_filtered<-df[filtered_genes,]
or
data$variance = apply(data, 1, var)
data = data[data$variance >= quantile(data$variance, c(0.25)), ]
data$variance <- NULL
Thank you very much for your help!
Penny
I may add that some filtering is probably still meaningful. You want to exclude the noise so the genes that are either completely non-expressed (=0) and with very low variation between samples as no variation would mean no correlated changes in expression, so the very metric that WGCNA is interested in. You could simply plot the row-wise variance and draw a visual cutoff to exclude genes with low information content. Example:
Based on this you could keep the top 10.000 genes as this is somewhat the inflextion point of the curve. Mind that the y-axis is already log10-scaled, on arithmetric scale the drop is even sharper. Keeping top-10k would exclude the majority of genes which most likely do not anything to the analysis as they probably do not any meaningful information.