I have downloaded RS126 dataset and when I extracted it, I got a bunch of files with .pdb extension. I dont know how to open these files. My intention is I want to give these files as input to neural network in MATLAB for protein secondary structure prediction. I found in some papers that before giving the sequence as input it must be encoded into matrix format. Could anyone please let me know how to do it in MATLAB?
I would like to know if I need to pre-process the data before giving it as input to the network and if yes, how do I need to do it. For example, convert it to matrix form and give the matrix as input to the neural network. How can I do that?
You can start with a simple model of secondary structure made of 3 types of elements: alpha helixes, b-sheets and everything else. Represent secondary structure as a string that goes together with the peptide sequence ------AAAAA-----BBBBB- Both alpha helix and beta sheet are local structures that depend on local sequence, so you do not need a super complicated model. You can start from predicting alpha helixes since it is very much defined by very short patterns in sequence (learning about secondary structures and folding will help you). If you want to use neural networks for some reason, you can give as input part of peptide sequences that form alpha helixes as one set and a part of the set of sequences forming alpha helixes to train your network to distinguish between the two. The leftover sequences can be used to test your network. Then you can add another block to your network to distinguish between beta sheet and everything else that is also not alpha helices. You can improve your network that way module by module. You may want to first reduce subsequence length dependence by starting with parts of same of same length
I'm just wondering why you need to do this? Is it just because you can/to educate yourself (which is fine). Secondary structure prediction is pretty good these days already...
Also, I can't help but think that if these questions are already causing you issues, how do you plan to implement something as complicated as a neural network?
I don't mean to sound harsh - these just seem like some very early/simple stumbling blocks to be having?
.pdb
files should be plain text. Use/open in any text editor.