Question

Proteins In Sequence Identity Range

2

Entering edit mode

13.1 years ago

Stef ▴ 50

Hello everyone. I'm still kind of new to the field, but for a project I have been trying to get two things.

I need a list of proteins(PDB ids) which all have sequence identity in a specific range(i.e. 20-30%) towards one other protein. This means that I will have one protein and a list of other proteins that have 20-30% identity towards it. The way I've been doing this till now was to use psi-cd-hit to get clusters over 20% and removing from these clusters all proteins that had over 30%. But this gives me unfortunately very small clusters(~10 proteins). So I was thinking of maybe doing a psi-blast scan with a protein against pdb_nr and take from the results all that have %identity in the range but also cover >80% of the query for example(and less than 30% gaps maybe?)
I need a list of proteins that have sequence identity in a specific range(i.e. 20-30%) against each other. That means that all proteins in the group have pairwise 20-30% identity to each other.

The difference between 1 and 2 is that in 1 I don't care how much % identity they have between each other in the list(I guess as long as it isn't 100% since they would be duplicates)

I wanted to know if there are any tools for finding these proteins. Any help would be appreciated.

Thanks in advance

sequence database • 3.7k views

ADD COMMENT • link updated 13.1 years ago by Bilouweb ★ 1.1k • written 13.1 years ago by Stef ▴ 50

0

Entering edit mode

Is the 20-30% identity rule you have imposed to be measured across the entire length of the query or only across the returned alignment? This is important as it dictates use of a local or global alignment tool.

ADD REPLY • link 13.1 years ago by Larry_Parnell 16k

0

Entering edit mode

Hm considering that I want different subparts of the sequence of the protein to still have 20-30% identity to the query and not just the whole, I would say local alignment. But a local alignment that covers a big percentage of both proteins. Also shouldn't contain too many gaps, otherwise parts of them won't have any sequence identity. I guess kind of like PDB clusters its proteins with BLASTClust

ADD REPLY • link 13.1 years ago by Stef ▴ 50

score 1 · Answer 1 · 2011-11-10

For part 1 I would suggest parsing the PSI-BLAST output, rather than use CD-HIT. In my experience CD-HIT doesn't find anywhere near as many similarities as PSI-BLAST, especially for similarity thresholds below about 40%. I've tried UCLUST as well, but it doesn't seem to work much better than CD-HIT for the percentages you're interested in.

Simon

score 0 · Answer 2 · 2011-10-14

Your approach to part 1 seems valid to me. A different query may produce a cluster of fewer or much more than the ten you've seen for your particular query. I know of no specific tools that can take a query as input and yield a set of proteins within a user-defined range of percent identity. That said, I'd check to see if this can be accomplished with Galaxy.

Part 2 is, of course, essentially a grid of all the 20-30% hits from query 1 against all those same hits. My feeling is this will fail to satisfy your criterion that all proteins in the group have identify to each other within that narrow range.

Viel Glück!

score 0 · Answer 3 · 2011-11-10

You will not find a tool which gives you directly the solution of your problem. But a little bit of scripting and some basic tools will help.

The PDB is a small databases of sequences (compared to uniprot or others). So I think it is a good idea to run psi-blast on a bigger database to get a bigger set of sequences.

I often use Psi-blast on the nr-database (on the ncbi website), and after, I find clusters with Uclust. I am quite satisfied with this procedure.

Nevertheless, using psi-blast does not guaranty a bigger set of sequences. You just search through a bigger database. Now, it depends on what you want to do with the sequences. Obviously, 10 sequences are not sufficient to produce strong statistics ;)