Question

correlation based supervised feature selection of gene expression data

0

Entering edit mode

7.1 years ago

izsyed16 ▴ 20

For performing filter based feature selection using gene expression data with (genes (features) along rows and samples along columns) how can I find feature-class correlation, If samples are categorized into two classes healthy and tumor? how can I find feature-class correlation if class is categorical and features are numerical? My data looks like the one given below.

             sample1  sample2   sample3   sample4    
gene 1       0.986    1.233     3.4556    1.6778
gene 2       2.3345   9.7865    8.1234    0.5656
gene 3       3.5677   9.563     2.344     1.987
.                 .
.                 .
.                 .
.                 . 
.                 .
gene1000     1.455    8.765     6.7788    9.877
class        Healthy  healthy   tumor     tumor

For correlation based feature selection should I consider only "tumor class" for heuristic evaluation or both classes?? I am trying to applying technique presented in following paper Correlation-based feature selection of discrete and numeric class machine learning

gene expression data feature selection correlation • 1.8k views

ADD COMMENT • link updated 7.1 years ago by Kevin Blighe 88k • written 7.1 years ago by izsyed16 ▴ 20

1

Entering edit mode

Is this RNA-seq or microarray data?

This is very easy to do for RNA-seq with packages like EdgeR or DESeq2 (or limma for microarray data), I'd really recommend them over trying to hack something together yourself.

ADD REPLY • link 7.1 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

Its micro array data I am trying to understand logic behind all that feature selection

ADD REPLY • link 7.1 years ago by izsyed16 ▴ 20

0

Entering edit mode

Hi izsyed16, I edited your question (above) in order to make the appearance better. If you paste code or output from a command, you should highlight it and click the '101 010' button.

As Jared implies, there are easier ways of working with this data if you're just looking for differences between tumour and normal. As this is microarray data (normalised, I assume?), then opt for limma.

The CFS algorithm to which you link was essentially the PhD thesis of Mark Hall in New Zealand (careful, large PDF). There are implementations of it in R and JAVA, but I have not tried them:

CfsSubsetEval - JAVA
cfs - R
FSelector - R

Why not perform it separately on tumour and normal, compare the results, and then take it from there?

Kevin

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you so much Kevin Blighe for making my question's appearance better. I am using 'array mining tool' for CFS and its output is top selected genes, But I want to grasp the idea that how is this tool finding feature-class correlation. In fact I need some example in which manually feature selection has been performed for 5 to 6 genes, So I can understand well.

ADD REPLY • link 7.1 years ago by izsyed16 ▴ 20

0

Entering edit mode

Hi friend, I honestly think that the best thing to do is look at the thesis of Mark Hall and then follow the formulae one by one. That is the best way to learn.

From what I can see, CFS (in a gene expression scenario) would aim to find a set of genes that are highly correlated to your endpoint of interest but not correlated with each other. Principal components analysis aims to do something similar but produces new features (called eigenvectors) based on variance and covariance, features which are neither correlated with each other. Both are attempting to best define a dataset but are using different metrics.

There are countless methods that aim to derive a bunch of variables that can best define a dataset. This enters into the realm of factor analysis and clustering.

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k