Hello,
I am currently working on a project that utilizes neural networks on protein sequences. However, I am quite stuck on what the best way of representing protein sequences. I will be getting protein sequence data from PDB, but each sample will have a different length of sequence. Ideally, I would like to represent the all the protein sequences in fixed size to pass into the neural network.
I found some packages that can get me numerical representation of features (descriptors), but all of their dimensions are different. For example, amino acid composition is dimension of 20, dipeptide composition is dimension 400, autocorrelations are dimension of 240.
I did thought about perhaps aligning them to get fixed length of sequences, but then I'm confused of how to represent insertions/deletions for each descriptors. Anyhow, I would like to know what are some good ideas to represent protein sequences?
Thank you so much!