Hello all,
I am generally of field computer science and data analytic. I have learned machine learning and I am solving one biological problem with using Support Vector machine.My question is I am having a data set of amino acid sequences. In our human body there are 20 standard amino acids and each amino acid contain sequences. I have found the composition of this amino acid sequences.This composition is nothing but the word count of each individual amino acid by its name and have counted its percentage that is composition of amino acid. Now I have to build a model for support vector machine using these composition as feature.Can any 1 give some idea how can I build SVM model?
Thanks in advance
Sequences is like this:
>HMPREF9352_0002 rod shape-determining protein MreD [Streptococcus gallolyticus subsp. gallolyticus TX20005]
MIKVKFYKNKYFLLLLLFLLMLIDGQLSFLASSIFSYHLKVSSHLLLLAVLYFYHDKNKY
FMFISSLVLGGIFDIYYLNRIGLVIFLLPILVIFTSKISKNFFVSNFQTLIFYIIVLFLF
EIVGELGAILLGMTTMSMTYFIAYCFAPTLIYNILMYLIFQKVFKKVFLES
From above Amino Acid sequence I found composition of the Amino Acid
Could you please give a better outline of the problem you are working on? It would be helpful if you could answer the following questions :-
Before building an SVM model, the most important thing to know is whether an SVM model is suited or would other Machine Learning techniques be more useful. Moreover, in any Machine Learning problem, it is very important to first get an idea of what is the function we are trying to train, what is the data set like, what are the limitations of the data at hand, and what features would be best, given the nature of the dataset and the function being predicted :)
What biological problem are you trying to solve? What data do you have? What outcome are you trying to predict using SVM?
Hello sir,
Thank you for replying me.
Actually I have number of amino acids sequences from that I found composition or you can say count of amino acids in percentage.
So I right now I want to build model of svm.
For example I have 6 sequences and I found composition so then I have input sequences but from that 6 sequences 3 for positive and 3 for negative I wants to take. That will not decide by the train function which is positive or which is negative. And for svm composition it self feature perform.
You still haven't told us what the question you're trying to address is. Are you trying to classify proteins (into two or more classes, if so which ones?) or are you trying to predict some property from the sequence?
If you have only 6 sequences, machine learning approaches really do not apply. Is this just a subset of a much larger dataset?
Hello sir,
What I want to do is, So, we have 6 sequences let's say that 3 of these are positive case while other three are negative. Now based on the AA composition of these sequences we need to build a model using SVM. This model would be used for predicting class of new sequences e.g. if we have a new sequence we should be able to predict whether it belongs to positive or negative class.
As above mention sequences are look like that and from that I found composition of the Amino acid.
It sounds like a homework problem, 6 sequences and 3 positive 3 negative is a textbook ML question. Real problems would have an odd number of each.
This is obviously an assignment.
Do you know what you are talking about? Machine learning never applies to such a small ridiculously small amount of data. I think you've misunderstood the question or your intention.
Homework problems are often toy-sized to help the student see all the moving parts and handle all decisions with pencil and paper. When teaching matrix multiply, you don't give a 100x100, you give a 3x3.
The problem really is that it feels like pujapatel doesn't know where to start, how to represent her data, basic computer science stuff.
If you're doing this to learn how to make an SVM, you should look up a tutorial for e1071 on google. If this is not for an assignment or just to learn, then I suggest don't use SVM. If I assume your sequence to be at least 50 amino acids, and there being 20 AA, a conjecture used by some machine learning groups says that you will need AT LEAST, 50 * log2(20), which means over 220 samples. And this is only to get a representative sample set of your sample space. So, I suggest you should drop the idea of SVM or even a neural network in that case. In biological terms, you need more samples to make a robust prediction :)