Hi everyone. Here are my questions. I have come up with a machine learning-based method for prediction of protein secondary structure. I evaluated my method using the publicly available dataset, RS126. However, as it is a little old, I decided to evaluate my method on a few more recent datasets as well. I read the recent articles and noticed that most of them have empoyed the CASP13, CASP12 and CASP11 datasets. I downloaded them from the "predictioncenter.org". There are many files included. But what I understand and need is the sequence file (the amino acid chains). What I don't understand is that there is not a secondary class lable for the residues of the corresponding sequences. Can anyone explain why? And does anyone know any other popular, publicly available and recent datasets for evaluation of protein secondary structure prediction? Thanks heaps.
There are a lot of secondary structure prediction tools, but this is prediction.
Like this one: http://download.igb.uci.edu/Bioinformatics-2014-Magnan.pdf
Why wouldn't you like to use proteins from pdb structural databank?
It enumerates secondary structure elements in each determined structure.
And HELIX is alpha-helix... But there are a lot of points of view here.
For example, GitHub gives:
Standardized data set for machine learning of protein structure
https://github.com/aqlaboratory/proteinnet
It may be more useful to you.
Conserning your question about nucleotides instead of proteins...
If you know the genetic code for some particular species, you can easily
transform nucliotide sequence into protein one, but not back because of redundancy,
right? And the genetic code itself may slightly vary between organisms.
Nucleotides look more reliable.