Diverse sampling from MSA / phylogenetic tree?

0

Entering edit mode

18 months ago

Nick ▴ 40

I have a multiple sequence alignment from protein sequences form mmseqs. I wonder if there are any methods that would allow me randomly sample from that space but with the constraint to be somewhat representative for some property (e.g. sequence homology in the database). I could of course cluster the sequences that wouldn't le me allow exact control over the number of sequences I want to have from the sampling.

I can think of many ways to do this manually pairwise alignments and then clustering with kmeans, computing embeddings + clustering, maybe phylogenetic tree computation + sampling.

Is there some "gold standard" method? I fail to find something.

clustering proteins sampling alignment msa • 584 views

ADD COMMENT • link updated 18 months ago by Jesse ▴ 850 • written 18 months ago by Nick ▴ 40

0

Entering edit mode

I had a somewhat similar question last year, though I never came up with a solution I was totally satisfied with. Maybe that could give you ideas, though. What's your end goal with the sampling approach?

ADD REPLY • link 18 months ago by Jesse ▴ 850

Login before adding your answer.