Forum:Care to speculate? Are Protein Fragments or Entire Protein Sequences useful when classifying via Machine Learning techniques
0
0
Entering edit mode
5.8 years ago
mcc ▴ 80

Greetings,

I have a question that I am investigating/researching but in the meantime would like to gather feedback regarding. Would you care to speculate or hypothesize?

Situation:

I would like to classify a set of proteins as either belonging to a group or not using machine learning techniques. Pretty straight forward so far. I have downloaded proteins from Uniprot, for example, protein-X vs Not protein-X. As one would expect, among the protein sequences many fragments (length < 50AA) are also present in the results.

Question:

Would you be inclined be to remove (OR not remove) the protein-X fragments (length < 50AA) from the super-set of proteins? Do the protein-X fragments represent the category of proteins being investigating or not?

I would appreciate your insights,

protein Machine-Learning classification • 1.9k views
ADD COMMENT
1
Entering edit mode

The straightforward answer is that there’s no straightforward answer. Biology isn’t a fan of rules, generally. So what goes for one protein family (X) might not hold for protein family Y. Short fragments could be meaningless noise/junk for one type of protein (especially if it’s on the larger side perhaps), but could be meaningful in other contexts.

ADD REPLY
0
Entering edit mode

Why machine learning? Why not use profile HMMs?

ADD REPLY
0
Entering edit mode

I am using 10 machine learning techniques that were discussed in this paper. My goal is to compare and contrast the algorithms using a standardized data set. I am interested to find which method(s) are best and in what situations. At this point, what the proteins are seem incidental.

ADD REPLY
0
Entering edit mode

What kind of classification are you thinking about? Is your set full length proteins (irrespective of the size)? Proteins can have multiple domains and if your classification is at that level then you may be able to use the entire set.

ADD REPLY
0
Entering edit mode

I intend to look at several different classification methods KNN, SVM, Decision Trees, Random Forests, etc. I have not chosen a final candidate yet. This work is exploratory so far. I plan to use the caret package in R to test several algorithms. The caret package, if you are not familiar, provides a consistent interface to run many models quickly almost at one.

ADD REPLY
0
Entering edit mode

What are you trying to classify proteins based on? "belonging to a group" doesn't really mean anything.

ADD REPLY
0
Entering edit mode

Not being a ML person I am thinking of the resulting classification in biological terms. Are you interested in classifying the proteins at domain, structure, function level. If that makes sense in ML terms.

ADD REPLY
0
Entering edit mode

@genomax I think you have hit upon my main question or issue. I am curious whether or not the protein fragments might / might not have certain domain information that the full length protein has. I am curious to see how misleading the addition of the protein fragments will be in determining the classifications. Will the fragments give misleading classifications? I am also concerned that I will not be able to tease out which attributes (secondary, tertiary, quaternary) will be dominant or is this a unique and interesting question.

ADD REPLY
0
Entering edit mode

Could you please explain this statement:

I have downloaded proteins from Uniprot, for example, protein-X vs Not protein-X.

What did you download from Uniprot, and how is this already classified in a binary manner?

ADD REPLY

Login before adding your answer.

Traffic: 1594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6