Hi,
I have 930 samples of RNA-seq for 4 conditions. I am using RSEM processed for the RNA-seq data. Now I am doing the following steps:
1- Picking 75 genes of interest for the 930 samples.
2- Importing such table into SVM to classify those 4 conditions based on those genes. (NOTE: 75% training set, 25% test set)
3- Result: 100% true position. NOTE: even if I decrease the number of features (genes) from 75 to 25, it gives the same result.
Does any know this problem? can SVM be used for multiple classifications on such data?
Data content (gene expression) starts from 0 to 12000 or even more.
If code required, let me know.
UPDATE
#NOTE: Here is my dataset structure and also please note the SC row which a number assigned to each group of samples.
Sample1 Sample2 Sample3 ....
GeneA 234 2324 811 4 23 0
GeneB . . . . . .
.
.
.
SC 1 1 1 2 2 2
x <- data.frame(t(data_set))
intrain <- createDataPartition(y = x$SC, p= 0.7, list = FALSE)
training <- x[intrain,]
testing <- x[-intrain,]
training[["SC"]] = factor(training[["SC"]])
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_Linear <- train(SC ~., data = training, method = "svmLinear",trControl=trctrl,
preProcess = c("center", "scale"),tuneLength = 10)
svm_Linear
test_prediction <- predict(svm_Linear, newdata = testing)
confusionMatrix(test_prediction, testing$SC)
UPDATE
Edit: "RMA" -> "RSEM"
Thanks for any help
Yes code and minimal dataset is required to help you ;)
Thanks. Plz see the Update :)
If I interpret correctly you are implying you are get 100% accuracy on the test data. As such there is no problem in running SVM on RNA-seq read counts but the results you seem to get are not believable. It is hard to comment without looking at your code snippet.
As such it seems, there is some overfitting is happening and results of the training data itself are being provided. Further it is advisable to normalise the raw counts using VST or some other log based transformation.
How did you normalize the RNASeq data? What is is your 5x cross validation results?
@kristoffer, the data were normalized using RMA algorithm.
Please see the update and let me know what you think about the code :)