Question

Likelihood Test For Dividing An Amino Acid Distribution Into Two Separate Groups

0

Entering edit mode

12.7 years ago

Will 4.6k

I posted this on Cross Validated and couldn't get an answer, I'm hoping someone here might have an idea:

I'm looking at amino acid distributions in multialigned sequences and we've grouped the patients into two groups (or 'unassignable') based on clinical parameters and we're looking for regions where the sequence distributions are different.

There are about ~600 sequences in my dataset. Group-1 is about 200 sequences and Group-2 is ~40.

My current method is to do a permutation test. To do this I take all letters from the entire set of sequences and shuffle them. Then I take the first ~200 into Group-1 and the next ~40 into Group-2. I calculate the observed distributions in each group and calculate the Euclidean-distance between the distributions. After ~10,000 shufflings I find the likelihood of getting a distance larger then the observed distance.

Obviously this is not my ideal method ... I don't think the Euc-Distance is the best choice, but I couldn't think of a better one. Any ideas on that front would be welcome too.

sequence statistics • 2.6k views

ADD COMMENT • link updated 12.7 years ago by matted 7.8k • written 12.7 years ago by Will 4.6k

score 1 · Answer 1 · 2012-08-22

How about a likelihood ratio test where you assume the amino acid counts are multinomially distributed?

You have two hypotheses: the two groups share the same set of parameters or the two groups have two separate sets of parameters. You estimate both parameters by maximum likelihood, which for the multinomial distribution is just counting. Then you compute the ratio of the likelihoods of observing the data under the two scenarios.

You can also do some permutations to establish a more reliable significance threshold that accounts for the correlation structure in your data. You could shuffle the group labels, keeping the individual sequences unchanged, and compute the test statistic in each case. Then you can compare the actual test statistic to this empirical null distribution.