I have a large number of nucleotide sequences where many are quite similar to each other and some quite different. I'd like to select a smaller representative set of sequences to use in my analysis that should capture as much of the spread of the full set as possible.
Currently I'm handling this by tabulating pairwise comparisons and selecting a subset of sequences where all sequences in the complete set are no more than X% away from one or more of those in my subset. I do that by starting with an arbitrary sequence, excluding everything close to it, and repeating the whole process with the farthest sequence away as the next start sequence. It roughly works but changing the starting point can give more or less efficient subsets. The first part of my question is, is there a smarter way to go about this to try to minimize the number of representatives needed for a particular identity threshold?
# What I'm using in R
# idents is data frame of query/ref/identity columns with all comparisons
# start is a query to start with
pick_reps <- function(idents, start, thresh=0.99) {
excludes <- subset(idents, ref == start & identity >= thresh)$query
chunk <- subset(idents, ! query %in% excludes)
if (nrow(chunk)) {
farthest <- chunk$query[chunk$identity == min(chunk$identity)][1]
return(c(start, pick_reps(chunk, farthest, thresh)))
} else {
return(start)
}
}
Where that idents table is something like:
query ref identity
00000014 00000014 1.0
00000014 00000018 0.9703264
00000014 00000031 0.9732938
...
The second part of my question might be, should I do something else entirely? In case this is an XY problem sort of situation: My end goal is a phylogenetic tree of related antibody sequences, but the IgPhyML algorithm is computationally intensive, and when I go from hundreds to thousands of input sequences it bogs down tremendously. Since many of the sequences are very close to each other I can speed things up by leaving many of those out. IgPhyML's suggested methods to improve performance are to either exclude sequences by depth (but I'd rather not, since low-abundance sequences could still be crucial to see how a lineage developed) or run multithreaded (but that doesn't apply to building individual trees like I have here). So I'm trying to find the best way to run this faster while still getting as realistic an impression of the lineage as possible.