Dear all,
I came across a problem that I thought would be easier to solve. What I want to do is accessing the PDB and download structures associated with a certain protein family. I then want to align those structures. So far so easy. BUT of course for some members of the protein family there is several structures, of which I want only one. It doesn't matter which one, but it has to be a unique set of members of the protein family. I am not posting my code here because this is not where I need help. I would need advise to a approach that is feasible and biologically acceptable, if that makes sense.
My approach so far:
- Query PDB for structures associated with protein family. This gives me a list of PDB IDs
- Download Fasta files associated with PDB IDs
- Construct MSA using Clustalo
- calculate pairwise sequence identity for each pair of sequences in the alignment --> yields a numpy array with 0 --> 1 for sequence identity between pairs
- filter array by cutoff set to 0.95 (yes some members of the family have very high sequence similarity) and delete everything higher than 0.95 sequence identity
- extract indices from array to use as unique set
Here is the problem: I significantly reduce the number of sequences, but I end up with some duplicates, because sometimes the sequences of the PDB entries of the same protein have different length etc., leading to lower sequence identity. Meaning I miss those.
If anyone has a smart idea to filter a list of PDBs to obtain one unique structure for every protein, please don't hesitate to post it here. Any input is highly welcome!
Couple of thoughts that may be in some way useful: