Best feature vector representation of Protein model?
3
0
Entering edit mode
9.9 years ago
Random • 0

I have a number of protein models of varying lengths in PDB format and I'm trying to do machine learning on them and predict their energy. I have the energy values of each of the protein models.

The problem is that machine learning algorithms obviously require a fixed length vector representation. The problem is that all my protein models have different lengths.

Does anyone know of a protein vector representation?

machine learning • 3.8k views
ADD COMMENT
1
Entering edit mode
9.9 years ago

Maybe, you can calculate the protein similarity matrix firstly, and the apply the kernel-based methods (such as kernel svm) to prediction the energy. Here, there are many method to get the protein similarity such as smith-waterman local aligment scores or blast bits scores. By applying such method, the length of protein may not have influence on prediction (Applicability Domain). For example, you have 200 proteins, and then you will get a 200 * 200 similarity matrix which will be used to build machine learning model to predict the corresponding energy values.

Hope this helps.

ADD COMMENT
1
Entering edit mode
8.3 years ago
ricardo • 0

You may want to consider using features based on reside cluster classes (http://www.sciencedirect.com/science/article/pii/S147692711530092X)

These are 26 features based on residue contacts and primary sequence contiguity

There is an easy to follow iPython notebook showing how to use this for structural classification : https://github.com/RicardoCorralC/rccPyDataLondon2015

Easiest way to get these features from a PDB file is by using a web service:

curl -X POST -F file=@1HIV.pdb 'http://neuralprotein.io:5000/api/v1/pdb/fold_space_vector/?distance_cutoff=5.0'

where 1HIV.pdb may be any PDB file

Hope this help.

ADD COMMENT
0
Entering edit mode
9.9 years ago
linus ▴ 360

Are your sequences related to each other?

If yes:

How about an alignment of them. Afterwards you would have equal length vectors, which you could for example encode very easy with a 20 bit vector for each AA position, or you could use some BLOSSUM representation or you pick a set of interesting attributes from http://www.genome.jp/aaindex/

If not:

You say you want to predict their energy. I may be wrong, but is the length not a very crucial part of the energy calculation (depending of course which energy you calculate). So maybe you could create vector instead of representing the AAs, you could calculate properties of your proteins, like number of helices or something else. (But to be honest I do not think this will yield in good predictions)

ADD COMMENT
0
Entering edit mode

Hi Linus; thanks for the response. The sequences are actually not related to each other. Using properties of the proteins is tough because it would give bad predictions. I am interested in using the distances of the atoms in the protein model; Is there a standard way to represent a protein model as a feature vector considering the atomic distances?

ADD REPLY

Login before adding your answer.

Traffic: 2229 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6