Hi everybody,
I have about 2000 protein structures which are needed to be classified based on their structural and sequential similarities. I need to classify them into several groups then pick up one representative for each group. Do you have any relible and useful tool suggestion for that? By the way, I also need to note the RMSD values for each. I appreciate if you can help me!!! Thank you!
Thank you so much! I check the sites for existing classifications but I could not find. Since I am really new-comer to this field, what do you suggest to me?
In my opinion, this is not a project for a beginner to do without supervision, especially on such a large number of proteins.
As to the classification, let's say that a PDB structure 4pze is one of those you are interested in. Simply go to the ECOD site listed above, choose
search by PDB ID
and enter4pze
. You will get this output:http://prodata.swmed.edu/ecod/complete/search?kw=4pze&type=pdbid
If you do the same exercise in CATH database, this will be the output:
http://www.cathdb.info/search?q=4pze
That particular protein has two domains, as you will see from classification. I don't kow which one would be interesting to you - that you will have to figure out by yourself. ECOD is updated weekly, so you may want to stick with that database. CATH is also update fairly frequently, but I think not weekly. SCOP database is update least often, so you may want to skip it.
Now, you don't want to copy and paste a PDB code 2000 times into the search box, but all these databases have downloadable files that can be parsed locally for matches to your group of proteins. Once you find your domain of interest, it is pretty straightforward to match your IDs to the existing classification, though it may not be so for you if you have never done it. My point still stands: it would be much faster and more accurate to use the classification from existing databases, than to classify on your own based on RMSD comparisons.