I am seeking to run a binary SVM classifier on a dataset of about 10,000 small molecule drugs. I want the classifier to determine which of these drugs are structural analogues to a specific drug. Each instance, or drug, will have five attributes containing some type of molecular descriptor.
For the training set, should I run the classifier on a dataset of drugs that are already known to be structural analogues of the drug? More specifically, a dataset with an even number of known structural analogues and non-analogues? And then do I test my classifier on the 10,000 small molecule drugs which have not yet been determined to be analogues?
You better start with defining a question and then see which tool (or tools) is the most appropriate. You might find out that unsupervised learning (clustering) would work better for you
Having that said, the input of SVM (training set) is a table of features of the observations and a binary vector specifying the group of each observation. The number of rows in the table and the vector should be the same. You can train your classifier with this data, people usually do a "leave X out cross validation" to test the performances of the classifier. The procedure uses a part of the training set to train the classifier and test it on the part left aside.
If you are convinced that your classifier is good you can predict the results for the 10'000 small molecules you are testing.
You should have this information. The groups are: analogues and non analogues. If you don't have this data for a set of molecules then you don't have a training set.
Thanks! So would I have to manually specify the group of each observation?
You should have this information. The groups are: analogues and non analogues. If you don't have this data for a set of molecules then you don't have a training set.