Hi,
I am currently working on a project to build a neural network which takes as input an amino acid sequence (protein fragment) with the fixed length of 34. I am trying to give a prediction whether or not the input sequence belongs to a certain class of repeat (TPR). Long story short:
My problem is to encode the sequence in order to have a proper input for the network. I thought about encoding each single amino acid with a vector of 20 bits (for 20 amino acids) having a '1' at the position in the vector representing the current amino acid and '0' for the other 19 bits. Concatenating these vectors leads me to a vector of length 20 * 34 which is quite big.
So does anybody here has any experience on how to represent an amino acid sequence to be able to provide it as input for a neural network.
Thank you!
Your one-hot encoding is commonly used, but you could also try to use physical/chemical properties (look up AAINDEX) to represent the amino acids.
Thank you. I'll take some properties from AAINDEX along with the one-hot encoding and see what the results will be.
Hi! Were you able to gather some experience around the issue? I am as well about to try those both options to see which performs better but I also suspect there might be other encoding schemes that are more efficient.
This expectation naively arises because, as for my case of 12 AAs, the information is theoretically 52 bits (=log2(20**12) but one-hot encoding virtually yields 240 bits, giving out a very scarce matrix, which in turn raises doubts about the efficiency of the convolution in later steps.
Currently reading this: https://pubs.acs.org/doi/10.1021/acs.jcim.0c00073