Could you suggest me a proper feature selection method for mixed type variable data?
2
0
Entering edit mode
7.6 years ago
morovatunc ▴ 560

Hi,

We are working on cancer mutation data and we found that upon TF binding, there has been an enrichment of mutation occurrence happened on these TF binding regions. Since we couldnt a strong reason for this occurrence we wanted to use feature selection methods.

We thought that say;

Y ( mutation no on these regions) ~ DHSites + Histone Marks + Other TF binding events (such as CTCT, EP300 etc) + RNAseq Reads

So in this matrix, we will have a single row for each binding event of our TF and all the variables will be either categorical(such as DHSsite) or numerical ( such RNAseq reads).

I have seen that people have applied random forest algorithms to predict mutation occurrence in specific regions. But our aim is not to predict anything but simple ask " What is the cause of the mutation occurrence". Therefore, I want to separate my data in to two subsets ( train vs test).

Please forgive my ignorance in the terminology and consider me as a frustrated grad student.

Best regards,

Tunc.

machine learning • 1.7k views
ADD COMMENT
1
Entering edit mode

There are many ways in which you can do it. Random forest are a good choice, after training you can look at the "variable importance" which will rank the variables of your model according to their contribution to the prediction. You can check the section of variable importance section of the Caret package.

Another choice is using the lasso regression, which try to set to zero the non-important variables. Just maybe one thing to consider is the normalization of your variables if they have different scales so you can get normalized factors. There are some good tutorials on lasso, for examples here and here

Hope it helps.

ADD REPLY
0
Entering edit mode

@Sirus thank you very much for your comment. The part where dividing data two train and test seems to confuse me a lot. Can only train my data ? and not do any prediction ?? Like a said in the question, i dont wan to predict anything. Is this possible with random forest?

ADD REPLY
2
Entering edit mode
7.6 years ago
Sirus ▴ 820

@morovatunc , to avoid over-fitting, you can use all your data but by doing for example 10-fold cross-validation (the Caret package can do that for you). Then you'll get your variable importance. Because theoretically, the signal that you'll find important is supposed to be important in any subset of the genome. A 10-fold CV will help eliminate some of the noise.

ADD COMMENT

Login before adding your answer.

Traffic: 2557 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6