I have a multiple sequence alignment from protein sequences form mmseqs. I wonder if there are any methods that would allow me randomly sample from that space but with the constraint to be somewhat representative for some property (e.g. sequence homology in the database). I could of course cluster the sequences that wouldn't le me allow exact control over the number of sequences I want to have from the sampling.
I can think of many ways to do this manually pairwise alignments and then clustering with kmeans, computing embeddings + clustering, maybe phylogenetic tree computation + sampling.
Is there some "gold standard" method? I fail to find something.
I had a somewhat similar question last year, though I never came up with a solution I was totally satisfied with. Maybe that could give you ideas, though. What's your end goal with the sampling approach?