Question

Feature Selection Methods For Rnaseq Data?

1

Entering edit mode

11.1 years ago

antass ▴ 30

I am working with RNAseq data - raw counts from HTSeq as well as RPKM from Cufflinks - and want to apply feature selection. For microarray data, I would usually look into using linear modeling, random forest, or R packages like glmnet.

Are there any feature selection experts out there who could recommend RNA-seq specific FS methods, preferably implemented in R?

rna-seq • 5.7k views

ADD COMMENT • link updated 7.2 years ago by cindy.perscheid ▴ 100 • written 11.1 years ago by antass ▴ 30

score 5 · Answer 1 · 2013-10-21

5

Entering edit mode

11.1 years ago

Sean Davis 27k

Linear models are possible using edgeR and DESeq2, among others. Random forests should still be applicable. If you use something like voom (limma) or vst (DESeq) to transform to more bell-shaped data, many other approaches are probably applicable, as well.

ADD COMMENT • link 11.1 years ago by Sean Davis 27k

0

Entering edit mode

I was planning on using voom transform and edgeR. I haven't used DESeq2 nor the vst transform - I'll look into those. Thanks Sean!

ADD REPLY • link 11.1 years ago by antass ▴ 30

1

Entering edit mode

Probably obvious, but just for posterity sake, one would not want to use voom in concert with edgeR for analysis since edgeR needs raw counts. You could process with voom and then use limma, though; alternatively, you could use the raw counts from HTSeq as direct input to edgeR.

ADD REPLY • link 11.1 years ago by Sean Davis 27k

0

Entering edit mode

Ah, yes - I wasn't very specific. I am not using these together. I am actually developing a biomarker, so I will try and test multiple combinations of parameters and methods (that hopefully will make sense as a combo). Thanks for the reminder.

ADD REPLY • link 11.1 years ago by antass ▴ 30

score 0 · Answer 2 · 2017-09-26

I would like to bring this topic up again. From what I have read in papers, Internet, and Bioconductor workflows so far, it seems that gene expression data sets are preprocessed (filtering, normalization, log-transformation,...), then a differential expression analysis is done (DESeq2, edgeR, ...), and afterwards an approach for pattern mining (e.g. clustering) is applied. For the latter, a feature selection method is used. A common example seems to be the rowVars function from the genefilter R package:

topVarGenes <- head(order(rowVars(dataset), decreasing = TRUE), 50)

I have also seen other approaches, e.g. applying InformationGain, ReliefF, etc. - well established methods. I was wondering, however, why are the results from the differential expression analysis not used for feature selection, as originally suggested here? Or is it used, but just poorly documented? What is the state of the art here?