I would like to bring this topic up again. From what I have read in papers, Internet, and Bioconductor workflows so far, it seems that gene expression data sets are preprocessed (filtering, normalization, log-transformation,...), then a differential expression analysis is done (DESeq2, edgeR, ...), and afterwards an approach for pattern mining (e.g. clustering) is applied. For the latter, a feature selection method is used. A common example seems to be the rowVars function from the genefilter R package:
topVarGenes <- head(order(rowVars(dataset), decreasing = TRUE), 50)
I have also seen other approaches, e.g. applying InformationGain, ReliefF, etc. - well established methods. I was wondering, however, why are the results from the differential expression analysis not used for feature selection, as originally suggested here? Or is it used, but just poorly documented? What is the state of the art here?
I was planning on using
voom
transform andedgeR
. I haven't usedDESeq2
nor thevst
transform - I'll look into those. Thanks Sean!Probably obvious, but just for posterity sake, one would not want to use voom in concert with edgeR for analysis since edgeR needs raw counts. You could process with voom and then use limma, though; alternatively, you could use the raw counts from HTSeq as direct input to edgeR.
Ah, yes - I wasn't very specific. I am not using these together. I am actually developing a biomarker, so I will try and test multiple combinations of parameters and methods (that hopefully will make sense as a combo). Thanks for the reminder.