Feature Selection Methods For Rnaseq Data?
2
1
Entering edit mode
11.2 years ago
antass ▴ 30

I am working with RNAseq data - raw counts from HTSeq as well as RPKM from Cufflinks - and want to apply feature selection. For microarray data, I would usually look into using linear modeling, random forest, or R packages like glmnet.

Are there any feature selection experts out there who could recommend RNA-seq specific FS methods, preferably implemented in R?

rna-seq • 5.8k views
ADD COMMENT
5
Entering edit mode
11.2 years ago

Linear models are possible using edgeR and DESeq2, among others. Random forests should still be applicable. If you use something like voom (limma) or vst (DESeq) to transform to more bell-shaped data, many other approaches are probably applicable, as well.

ADD COMMENT
0
Entering edit mode

I was planning on using voom transform and edgeR. I haven't used DESeq2 nor the vst transform - I'll look into those. Thanks Sean!

ADD REPLY
1
Entering edit mode

Probably obvious, but just for posterity sake, one would not want to use voom in concert with edgeR for analysis since edgeR needs raw counts. You could process with voom and then use limma, though; alternatively, you could use the raw counts from HTSeq as direct input to edgeR.

ADD REPLY
0
Entering edit mode

Ah, yes - I wasn't very specific. I am not using these together. I am actually developing a biomarker, so I will try and test multiple combinations of parameters and methods (that hopefully will make sense as a combo). Thanks for the reminder.

ADD REPLY
0
Entering edit mode
7.3 years ago

I would like to bring this topic up again. From what I have read in papers, Internet, and Bioconductor workflows so far, it seems that gene expression data sets are preprocessed (filtering, normalization, log-transformation,...), then a differential expression analysis is done (DESeq2, edgeR, ...), and afterwards an approach for pattern mining (e.g. clustering) is applied. For the latter, a feature selection method is used. A common example seems to be the rowVars function from the genefilter R package:

topVarGenes <- head(order(rowVars(dataset), decreasing = TRUE), 50)

I have also seen other approaches, e.g. applying InformationGain, ReliefF, etc. - well established methods. I was wondering, however, why are the results from the differential expression analysis not used for feature selection, as originally suggested here? Or is it used, but just poorly documented? What is the state of the art here?

ADD COMMENT

Login before adding your answer.

Traffic: 971 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6