Entering edit mode
8.9 years ago
r.follador
▴
90
This is probably more of a CS question than biology:
Given a set of m protein sequences I want to select n candidates out of this set (n is a given number), which maximize the diversity.
I would probably start with a distance matrix (made by clustalo). Now I want to choose my n candidates in this way, that the total sum of distance of each candidate to every other candidate is maximized.
The goal is to get a subset of the m protein sequences, which is still more or less representative in terms of the diversity.
What approach would you suggest?