I'm trying to classify short reads to a number of bins (usually no more than 5). After looking for a while in the libSVM faq as well as in relevant papers, I think that one-class SVM classifiers may be what I'm looking for; I just need to know whether a read belongs to a bin or not. This is what a one-class classifier will tell me, right?
The problem is that after preparing the training set (have tried 500 and 1000 vectors) and doing the testing, classification accuracy can't get above 32% (lots of false negatives).
I noticed that there's a one-class-specific parameter, named "nu" (-n switch in svm-train). I wrote a Perl script and tried different values for it (from 0.001 to 1 in 0.001 steps) but can't get a decent accuracy...
Has anyone more experience with such classifiers and give me some hints, please?
Are you trying to predict a given short read is a part of any of the 5 bins or you already have different models based on different bins ? I think then this could be a multi-svm problem than a one-class libSVM. Single-class svm can be used only for problems based on two classes (a or b).
panos, what prevents you from creating a 6th bin of "unclassified"? also, is it possible for a read to belong to two bins? which is going to happen if you go with 5 single-class SVMs.
Can you clarify what the "bins" are? and may be a preview of what the data looks like (features, classes..)? in some problem instances, it is not easily separable by hyperplanes.
I'm trying to predict whether a read belongs to any of 5 predefined bins. I think that it would be better to go with 5 single-class SVMs rather than one 5-class because I have the impression that multi-class SVMs would only classify a read to one of the 5 bins; it wouldn't "consider" the possibility that a read could not be a member of any of the 5 bins (i.e. leave it unclassified). Am I right?
I think, though, that in the case of multi-class SVMs, I can calculate probabilities for predictions. Is this a way for telling whether a given read cannot be classified into any of the specified bins?
I can't create such a 6th bin because my other 5 bins would be representing the dominant bacteria in my sample. This would mean that this 6th class would have represent EVERY other bacterium... No, no read can belong to two (or more) bins. I haven't thought about it! Good point! Do you think that taking the probability of the prediction (-b switch in svm-predict) into account could help me decide whether assigning a given read to some bin, is significant?
Often with multi-class SVM classifiers the class with the highest score is picked as the output class. It sounds like what you want is to only allow classifications where the highest score is also greater than some threshold. All other data would be classified as "unclassifiable". That may not be available in existing packages, but it would alleviate some of the difficulties of using 5 single-class SVMs.