Hello everyone. I'm still kind of new to the field, but for a project I have been trying to get two things.
I need a list of proteins(PDB ids) which all have sequence identity in a specific range(i.e. 20-30%) towards one other protein. This means that I will have one protein and a list of other proteins that have 20-30% identity towards it. The way I've been doing this till now was to use psi-cd-hit to get clusters over 20% and removing from these clusters all proteins that had over 30%. But this gives me unfortunately very small clusters(~10 proteins). So I was thinking of maybe doing a psi-blast scan with a protein against pdb_nr and take from the results all that have %identity in the range but also cover >80% of the query for example(and less than 30% gaps maybe?)
I need a list of proteins that have sequence identity in a specific range(i.e. 20-30%) against each other. That means that all proteins in the group have pairwise 20-30% identity to each other.
The difference between 1 and 2 is that in 1 I don't care how much % identity they have between each other in the list(I guess as long as it isn't 100% since they would be duplicates)
I wanted to know if there are any tools for finding these proteins. Any help would be appreciated.
Thanks in advance
Is the 20-30% identity rule you have imposed to be measured across the entire length of the query or only across the returned alignment? This is important as it dictates use of a local or global alignment tool.
Hm considering that I want different subparts of the sequence of the protein to still have 20-30% identity to the query and not just the whole, I would say local alignment. But a local alignment that covers a big percentage of both proteins. Also shouldn't contain too many gaps, otherwise parts of them won't have any sequence identity. I guess kind of like PDB clusters its proteins with BLASTClust