Question

Tool for finding unique sets of proteins

0

Entering edit mode

10.5 years ago

Woa ★ 2.9k

I've two sets of large number of proteins( in the order 100K) , and wish to find out unique proteins belonging to each set.

Is there any tool for doing it fast?

Thanks

set sequence • 2.2k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by Woa ★ 2.9k

score 0 · Answer 1 · 2014-06-01

0

Entering edit mode

10.5 years ago

Prakki Rama ★ 2.7k

Try running BLAST. If FILE_A sequences are matching FILE_B sequences with 100% from one end to other, they are exact matches. You could ignore those which matched, but any proteins which did not find hit in the other file must be unique.

ADD COMMENT • link 10.5 years ago by Prakki Rama ★ 2.7k

Ram · Answer 2 · 2014-06-01

0

Entering edit mode

10.5 years ago

Adrian ▴ 700

BLAST would be doing much more work than is necessary to solve the problem you've posed.

The most popular tools for clustering sequences to find the unique ones are probably CD-HIT and USEARCH.

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by Adrian ▴ 700

0

Entering edit mode

The reason I would not go clustering tools is that, they cluster based on input parameters, and outputs those which did not meet the criteria as unique. Especially, when one does not know how much similar is the other organism, it is hard to put a similarity cutoff. But in contrast, BLAST computes the similarity and tables the results. So, we could cherry pick those which did not have hit as unique sequences to that particular file. If at all, the user wants, he could still use the output file generated from BLAST file and could put cutoff's and pick up hits he wanted. Nonetheless, I would be happy to hear your points also for choosing clustering techniques.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

It depends what one wants. If, as the question states, one wants to find the unique proteins in the set, then the problem is to do exact clustering. Doing that is going to be very much faster than doing the NxN BLAST. I agree that if the questions are more subtle, having the NxN BLAST results to play with could be useful.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by Adrian ▴ 700

0

Entering edit mode

Yes, it depends on what one wants. Thanks.

ADD REPLY • link 10.5 years ago by Prakki Rama ★ 2.7k