Currently, I'm doing a research to develop a deep learning method to predict of two proteins have similar function given the features of the two proteins.
To build this deep learning model, I need a proper dataset to train it, the requirements of the dataset are:
- contains enough protein pairs with quite a number of pairwise features
- each protein appears in this dataset with known GO terms(using GO terms to calculate semantic similarity of two proteins as label to train the model)
Is there any dataset can meet my demands?
What's more, now, I only found a dataset here: http://mine5.ics.uci.edu:1026/gain.html
It was generated from Lindahl's dataset with pairwise features, but without GO term annotation,
Total number of unique proteins: 976
Total number of query-template pairs: 951600
Some of the proteins' name like these
1chl-d1chl
1tnfa-d1tnfa
1eac-d1eaf
1gdha-d1gdha2
2avia-d2avia
3pgm-d3pgm
1brnl-d1brnl
1pgga-d1prha2
..
I don't what does the name format means, it's like combination of PDB and SCOP
If I use this dataset, how can I find the GO terms of each proteins in the this dataset
Thanks!
Did you try to find orthologs of your query protein? If you find Orthologs then try to look for GO term from GOA database.