Extremely large difference between two sample sizes of Mann–Whitney U test
3
0
Entering edit mode
9.8 years ago
billzt ▴ 20

I need to compare the derived allele frequency spectrum of my studied mutations with the synonymous SNPs. The number of my studied mutations is very small, only 14, while the number of genome-wide synonymous SNPs is up to 1 million. Therefore these two sample size are largely different and directly applied Mann-Whitney U test to them is of course no significant difference. How do I know that the non-significant result is due to small sample size or due to that these two samples are truly no difference?

sample-size SNP Mann-Whitney-U-test • 6.4k views
ADD COMMENT
2
Entering edit mode
9.8 years ago

You could perform permutation testing: select 14 SNPs at random from those 1mln, say 10000 times and build the histogram of their allele frequency means / medians. The number of times you get a larger allele frequency divided by 10000 will be the P-value.

You can also build the allele frequency distribution for all your SNPs and see how large the allele frequencies of your 14 SNPs are in respect to it.

Visualizing allele frequency distributions could give you some insight on what is happening in your data.

ADD COMMENT
0
Entering edit mode

Thank you, I'll try it.

ADD REPLY
2
Entering edit mode
9.8 years ago

Therefore these two sample size are largely different and directly applied Mann-Whitney U test to them is of course no significant difference.

I don't think the lack of significant difference is due to the different sample sizes, why should it be the case? (In fact I don't see the point in resampling SNPs from the large set). Rather, the small dataset reduces power so much that the difference you see is non significant.

How do I know that the non-significant result is due to small sample size or due to that these two samples are truly no difference?

These are two sides of the same coin. The difference you observe is not significant because the sample size is not large enough. With huge sample sizes even tiny differences would produce very small p-values, in that case the question would be "Is this difference biologically meaningful?"

This is to illustrate the point. Produce two sets differing by small amount. The p-value for the difference is highly significant since the sample sizes are large. If you downsample one set the difference is no longer significant:

set.seed(1)
set1<- rbeta(n= 10000, 10, 10)

set.seed(2)
set2<- rbeta(n= 10000, 10, 9.5)

Difference between set 1 and 2 is significant even if the difference is small:

mean(set1); mean(set2)
[1] 0.5006102
[1] 0.5112549
wilcox.test(set1, set2)
# p-value = 1.023e-11

# Now reduce one set to 14 obs:
set.seed(3)
wilcox.test(sample(set1, size= 14), set2)
# p-value = 0.4413
ADD COMMENT
0
Entering edit mode
9.8 years ago
Asaf 10k

You can select a small random set of SNPs from the million and run them against your test set.

ADD COMMENT
1
Entering edit mode

Just be sure to repeat this random selection to see how variable it is. Set up a loop to repeat 100 times and make a histogram of the results. Then set it to 10,000 and go to lunch.

ADD REPLY
0
Entering edit mode

Thank you. I'll try it

ADD REPLY

Login before adding your answer.

Traffic: 808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6