For performing filter based feature selection using gene expression data with (genes (features) along rows and samples along columns) how can I find feature-class correlation, If samples are categorized into two classes healthy and tumor? how can I find feature-class correlation if class is categorical and features are numerical? My data looks like the one given below.
sample1 sample2 sample3 sample4
gene 1 0.986 1.233 3.4556 1.6778
gene 2 2.3345 9.7865 8.1234 0.5656
gene 3 3.5677 9.563 2.344 1.987
. .
. .
. .
. .
. .
gene1000 1.455 8.765 6.7788 9.877
class Healthy healthy tumor tumor
For correlation based feature selection should I consider only "tumor class" for heuristic evaluation or both classes?? I am trying to applying technique presented in following paper Correlation-based feature selection of discrete and numeric class machine learning
Is this RNA-seq or microarray data?
This is very easy to do for RNA-seq with packages like EdgeR or DESeq2 (or limma for microarray data), I'd really recommend them over trying to hack something together yourself.
Its micro array data I am trying to understand logic behind all that feature selection
Hi izsyed16, I edited your question (above) in order to make the appearance better. If you paste code or output from a command, you should highlight it and click the '101 010' button.
As Jared implies, there are easier ways of working with this data if you're just looking for differences between tumour and normal. As this is microarray data (normalised, I assume?), then opt for limma.
The CFS algorithm to which you link was essentially the PhD thesis of Mark Hall in New Zealand (careful, large PDF). There are implementations of it in R and JAVA, but I have not tried them:
Why not perform it separately on tumour and normal, compare the results, and then take it from there?
Kevin
Thank you so much Kevin Blighe for making my question's appearance better. I am using 'array mining tool' for CFS and its output is top selected genes, But I want to grasp the idea that how is this tool finding feature-class correlation. In fact I need some example in which manually feature selection has been performed for 5 to 6 genes, So I can understand well.
Hi friend, I honestly think that the best thing to do is look at the thesis of Mark Hall and then follow the formulae one by one. That is the best way to learn.
From what I can see, CFS (in a gene expression scenario) would aim to find a set of genes that are highly correlated to your endpoint of interest but not correlated with each other. Principal components analysis aims to do something similar but produces new features (called eigenvectors) based on variance and covariance, features which are neither correlated with each other. Both are attempting to best define a dataset but are using different metrics.
There are countless methods that aim to derive a bunch of variables that can best define a dataset. This enters into the realm of factor analysis and clustering.