Question

Best Representation of Protein Sequences

0

Entering edit mode

7.5 years ago

khaeuk ▴ 100

Hello,

I am currently working on a project that utilizes neural networks on protein sequences. However, I am quite stuck on what the best way of representing protein sequences. I will be getting protein sequence data from PDB, but each sample will have a different length of sequence. Ideally, I would like to represent the all the protein sequences in fixed size to pass into the neural network.

I found some packages that can get me numerical representation of features (descriptors), but all of their dimensions are different. For example, amino acid composition is dimension of 20, dipeptide composition is dimension 400, autocorrelations are dimension of 240.

I did thought about perhaps aligning them to get fixed length of sequences, but then I'm confused of how to represent insertions/deletions for each descriptors. Anyhow, I would like to know what are some good ideas to represent protein sequences?

Thank you so much!

R Python NeuralNetworks MachineLearning Protein • 1.6k views

ADD COMMENT • link updated 7.5 years ago by Asaf 10k • written 7.5 years ago by khaeuk ▴ 100

score 0 · Answer 1 · 2018-02-12

From my limited knowledge in NNs I think that you shouldn't represent 3D structures using features. The great benefit of NNs is that the machine can generate features. I think that you would want to have the angles between AAs as input and the AAs themselves so you'll have 20+ dimension vector. You do have a problem with unequal length but I think you can overcome this using layers of the network, a function that takes input in varying size and output a fixed size vector. It all depends on what you're trying to predict.