problem with classification genes using Random Forest Classifier
1
0
Entering edit mode
3.0 years ago
sadaf ▴ 20

I have 7000 gene entries with their numeric attributes such as time, .... in an excel file and I want to get a small set of genes (20-30 genes) that are the most important ones based on their attributes.

gene  time  p-value
x     8     0.05
z     4     0.048
g     24    0.06
y.    48    0.07

My code:

myData = pd.read_csv('D:/python/gene.csv', index_col=False)

X = myData.drop(['gene'], axis=1)

Y = myData['gene']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators = 2000, random_state =42)

model.fit(X_train, Y_train)

prediction_test = model.predict(X_test)


print(prediction_test)

print ("Accuracy = ", metrics.accuracy_score(Y_test, prediction_test))

I tried to get the result using scikit-learn library in python but there are 2 problems:

  • The number of classes is too high (genes assumed as classes).
  • I could not get the small set of genes at the end of process.

Could you guid me how can I get the small set of important genes using random forest classifier?

R python randomforest genes classification • 1.3k views
ADD COMMENT
0
Entering edit mode

you should include your code.

ADD REPLY
0
Entering edit mode

I included

ADD REPLY
1
Entering edit mode
3.0 years ago
Mensur Dlakic ★ 28k

I don't think you are going about it the right way. First, your time variable should probably be encoded as categorical rather than numerical. It makes sense to treat p-values as numerical, because there is a clear relationship where smaller p-values are better than larger. The same is likely not true for the time variable.

Second, classifiers work best when classifying a relatively small number of outcomes, such as [upregulated, downregulated, unchanged]. Something similar to those are presumably the outcomes in your experiment as well. Instead of having genes as features for which the importance can be determined at the end of RF classification, you have them as Y variable. That's why your number of classes is too high.

A couple of technical suggestions. I think you should make an effort to format the code better. Look at it yourself and think whether you would be able to figure out without significant effort where each line ends and the next begins. You are asking us to help you, but you need to do your part first. It is not that difficult: select the part of your code that you want formatted, and hit [CTRL+K]. It should take care of it, but there is a preview beneath the typing box showing how it will look in the end where you can verify. It is an overkill to use 2000 estimators - in most cases 100-200 is plenty. When you are done, and assuming you change the number of classes to something that is more tractable, you will get feature importance at the and of classification. In your case that would be in model.feature_importances_.

ADD COMMENT
0
Entering edit mode

Formatted the codes, thanks. Yes, this is the exact problem with my data (too many genes), you meant that my target should be up&down regulation and I should treat genes as features?

ADD REPLY
0
Entering edit mode

You may want to drop the genes completely, and let them be classified based on time and p-values as features (and maybe other features if you have them). From feature importance and from tree splits on those features, you should be able to get back to genes that are important. If you want to keep the genes as features, they need to be treated as categorical variables because for classification purposes there is no ordered relationship of g vs. z as opposed to g vs. y.

ADD REPLY
0
Entering edit mode

I droped my genes column and considered up&down regulation as variables and got the whole list but how can I get back to genes that are important.

ADD REPLY
1
Entering edit mode

It won't be a great classifier if you have only two features - you should keep that in mind.

Relative feature importances are in your fitted classifier. Assuming you used the same code as above, they would be in model.feature_importances_.

For the rest, it should help to visualize the leafs in your decision tree and how they make decisions based on feature splits. In the example below this is illustrated on decision trees, but it should work the same for your fitted RF classifier. Once you find what time and p-value contribute to most splits, that should help you identify most important genes that contribute to up/downregulation.

ADD REPLY
0
Entering edit mode

Thanks for your effort. Are there any other methods that can be used to find the most important genes based on some features?(rather than SVM, RRA, RF,....)

ADD REPLY

Login before adding your answer.

Traffic: 1115 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6