Question

Best feature vector representation of Protein model?

0

Entering edit mode

10.0 years ago

Random • 0

I have a number of protein models of varying lengths in PDB format and I'm trying to do machine learning on them and predict their energy. I have the energy values of each of the protein models.

The problem is that machine learning algorithms obviously require a fixed length vector representation. The problem is that all my protein models have different lengths.

Does anyone know of a protein vector representation?

machine learning • 3.8k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Random • 0

Ram · Answer 1 · 2014-12-14

Maybe, you can calculate the protein similarity matrix firstly, and the apply the kernel-based methods (such as kernel svm) to prediction the energy. Here, there are many method to get the protein similarity such as smith-waterman local aligment scores or blast bits scores. By applying such method, the length of protein may not have influence on prediction (Applicability Domain). For example, you have 200 proteins, and then you will get a 200 * 200 similarity matrix which will be used to build machine learning model to predict the corresponding energy values.

Hope this helps.

score 1 · Answer 2 · 2016-07-26

You may want to consider using features based on reside cluster classes (http://www.sciencedirect.com/science/article/pii/S147692711530092X)

These are 26 features based on residue contacts and primary sequence contiguity

There is an easy to follow iPython notebook showing how to use this for structural classification : https://github.com/RicardoCorralC/rccPyDataLondon2015

Easiest way to get these features from a PDB file is by using a web service:

curl -X POST -F file=@1HIV.pdb 'http://neuralprotein.io:5000/api/v1/pdb/fold_space_vector/?distance_cutoff=5.0'

where 1HIV.pdb may be any PDB file

Hope this help.

Ram · Answer 3 · 2014-12-13

0

Entering edit mode

10.0 years ago

linus ▴ 360

Are your sequences related to each other?

If yes:

How about an alignment of them. Afterwards you would have equal length vectors, which you could for example encode very easy with a 20 bit vector for each AA position, or you could use some BLOSSUM representation or you pick a set of interesting attributes from http://www.genome.jp/aaindex/

If not:

You say you want to predict their energy. I may be wrong, but is the length not a very crucial part of the energy calculation (depending of course which energy you calculate). So maybe you could create vector instead of representing the AAs, you could calculate properties of your proteins, like number of helices or something else. (But to be honest I do not think this will yield in good predictions)

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by linus ▴ 360

0

Entering edit mode

Hi Linus; thanks for the response. The sequences are actually not related to each other. Using properties of the proteins is tough because it would give bad predictions. I am interested in using the distances of the atoms in the protein model; Is there a standard way to represent a protein model as a feature vector considering the atomic distances?

ADD REPLY • link 10.0 years ago by Random • 0