Entering edit mode
11.6 years ago
Pappu
★
2.1k
I want to remove the sequences which have >90% sequence identity keeping the larger sequence. I am wondering if there is any tool for that.
Say A and B has 90% identity and B is longer; B and C has 90% identity and C is longer. Do you want to remove both A and B?
Exactly, I want to remove all the subsets of sequences with >90% identity.
I was not clear: in the example about, A and C do not have 90% identity. Because B has been thrown away, you may think A should be kept as it is not within 90% identity to other chosen sequences. Do you still want to remove A? If you want to remove A, that is single-linkage clustering or equivalently to find connected components in a graph. You can find the algorithm on wiki and many other places. It is pretty simple and should be achievable in <50 lines of Perl.