I posted this on Cross Validated and couldn't get an answer, I'm hoping someone here might have an idea:
I'm looking at amino acid distributions in multialigned sequences and we've grouped the patients into two groups (or 'unassignable') based on clinical parameters and we're looking for regions where the sequence distributions are different.
There are about ~600 sequences in my dataset. Group-1 is about 200 sequences and Group-2 is ~40.
My current method is to do a permutation test. To do this I take all letters from the entire set of sequences and shuffle them. Then I take the first ~200 into Group-1 and the next ~40 into Group-2. I calculate the observed distributions in each group and calculate the Euclidean-distance between the distributions. After ~10,000 shufflings I find the likelihood of getting a distance larger then the observed distance.
Obviously this is not my ideal method ... I don't think the Euc-Distance is the best choice, but I couldn't think of a better one. Any ideas on that front would be welcome too.
Thanks! That worked perfectly!