Greetings,
I have a question that I am investigating/researching but in the meantime would like to gather feedback regarding. Would you care to speculate or hypothesize?
Situation:
I would like to classify a set of proteins as either belonging to a group or not using machine learning techniques. Pretty straight forward so far. I have downloaded proteins from Uniprot, for example, protein-X vs Not protein-X. As one would expect, among the protein sequences many fragments (length < 50AA) are also present in the results.
Question:
Would you be inclined be to remove (OR not remove) the protein-X fragments (length < 50AA) from the super-set of proteins? Do the protein-X fragments represent the category of proteins being investigating or not?
I would appreciate your insights,
The straightforward answer is that there’s no straightforward answer. Biology isn’t a fan of rules, generally. So what goes for one protein family (X) might not hold for protein family Y. Short fragments could be meaningless noise/junk for one type of protein (especially if it’s on the larger side perhaps), but could be meaningful in other contexts.
Why machine learning? Why not use profile HMMs?
I am using 10 machine learning techniques that were discussed in this paper. My goal is to compare and contrast the algorithms using a standardized data set. I am interested to find which method(s) are best and in what situations. At this point, what the proteins are seem incidental.
What kind of classification are you thinking about? Is your set full length proteins (irrespective of the size)? Proteins can have multiple domains and if your classification is at that level then you may be able to use the entire set.
I intend to look at several different classification methods KNN, SVM, Decision Trees, Random Forests, etc. I have not chosen a final candidate yet. This work is exploratory so far. I plan to use the caret package in R to test several algorithms. The caret package, if you are not familiar, provides a consistent interface to run many models quickly almost at one.
What are you trying to classify proteins based on? "belonging to a group" doesn't really mean anything.
Not being a ML person I am thinking of the resulting classification in biological terms. Are you interested in classifying the proteins at domain, structure, function level. If that makes sense in ML terms.
@genomax I think you have hit upon my main question or issue. I am curious whether or not the protein fragments might / might not have certain domain information that the full length protein has. I am curious to see how misleading the addition of the protein fragments will be in determining the classifications. Will the fragments give misleading classifications? I am also concerned that I will not be able to tease out which attributes (secondary, tertiary, quaternary) will be dominant or is this a unique and interesting question.
Could you please explain this statement:
What did you download from Uniprot, and how is this already classified in a binary manner?