I would like to understand how the PDB manages domains. My understanding is that the internal folding of a domain is independent of inter-domain contacts that may occur in final 3d protein structure. If this is the case, why are recurring domains across a variety of proteins repeated in every PDB file in which they occur rather than have some kind of non-redundant representation and reference it? Is there a way to obtain a non-redundant representation of the 3 dimensional structures of protein domains in terms of internal coordinates?
I agree, I think understanding CATH and how it breaks down and clusters these units is the key. The only trouble is, I was hoping that where a domain occurs it will always have the same or close to the same torsion angles. Actually, I would be satisfied if all intra-domain contacts are always the same across molecular complexes. I'm not sure if that is the case or not. I will test that on a few examples from PDB but my general theoretical knowledge will still be lacking if I find they are equal since that may not always be the case.
For Calpha atoms torsion angles will be a bit more preserved, but overall structures of domains are not the same even for the same protein in PDB, since there will be a difference in sequence, ligands, and software to make a 3d protein structure from electron density map.
That is very interesting. Then from what I can tell, if one were to mine PDB structures to predict contact or conformations then the best level to do this at is the level of the chain since domains within chains will have different conformations depending on the chain (sequence) they are taking part in.
But I fear it may not be as simple as that since ligands are not part of peptide chains per se and nonetheless have effect on the conformations of the chains it contacts? Hmmm.
There are many papers out there doing ML and stat technique to predict conformations from sequence and known structural data but I have not seen these considerations addressed and perhaps the mining of structural information is not as simple as it may seem at first. It seems what I have learned on this thread is a sequence without it context is not meaningful and Anfinsen's principal applies only to the full biological unit and not at domain or even chain level.
You asked if the structures are different. They are different in PDB which is a repository of molecule models from crystallization and NMR experiments. PDB structures have changes in sequence, extra ligands added and sometimes heavy atoms to help the crystallization process.
You are now switching to questions of protein structure in a living organism where the sequence is "original" there are chaperones and other machinery to guide folding, there are no extra ligands added or heavy atoms because a living organism has no intention to crystallize the protein.
When you want to predict the function of protein, it's folding, its interactions with other proteins or drugs you want to do it for a "living organism" and PDB structure is only a starting point. In the end the structure of a protein there is shaking because it has relatively high temperature compared to the crystallized state. Moreover, some parts of the protein do not have stable structure at all and always moving around like =)
Head exploding now. Your answer is very informative and the exploding is not due to misunderstanding but understanding it is more complicated than initially thought for a CS/MATH person to address in a vacuum. When CASP puts out a structure target for folks to try to predict because it will be soon crystallized I was under the impression they were predicting protein conformation with respect to newly submitted PDB entry. Perhaps that is the case and what you are saying is its biological relevance is limited or else it is the best that one could do in terms of structure prediction because conformations in solution are not so easily obtained and hence this sort of prediction in crystallization is only a guide to the molecular biologist in his investigation of the real structure and function of the entire complex. Sorry to ask such an extend questions but I think this is very close to a final summation of my misunderstanding on this point. Thanks again for your help it is very much appreciated.
Could you please clarify your question?
Rather than clarify my question I'm going to look at the ICM-Browser and do as you instruct below. If that is not free I'll try it with PyMOL. Haven't used either yet. I think you are right in that I've gone as far as I could with juggling the concepts in the abstracts and I need to put my head back down and see what I fine in the direction you indicated.
As an aside, I'm not sure what you mean by 'differences in sequence'. I thought a sequence is what defines a protein. I'm assuming you mean differences in the overall sequence a 'domain' is found in.
I meant that protein sequences in PDB usually are truncated, mutated, have insertions and deletions compared to what you normally can see in a living organism you study (for example in RefSeq).