I have 7000 gene entries with their numeric attributes such as time, .... in an excel file and I want to get a small set of genes (20-30 genes) that are the most important ones based on their attributes.
gene time p-value
x 8 0.05
z 4 0.048
g 24 0.06
y. 48 0.07
My code:
myData = pd.read_csv('D:/python/gene.csv', index_col=False)
X = myData.drop(['gene'], axis=1)
Y = myData['gene']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators = 2000, random_state =42)
model.fit(X_train, Y_train)
prediction_test = model.predict(X_test)
print(prediction_test)
print ("Accuracy = ", metrics.accuracy_score(Y_test, prediction_test))
I tried to get the result using scikit-learn library in python but there are 2 problems:
- The number of classes is too high (genes assumed as classes).
- I could not get the small set of genes at the end of process.
Could you guid me how can I get the small set of important genes using random forest classifier?
you should include your code.
I included