On the significance of the overlap between lists of genes
2
1
Entering edit mode
6.8 years ago
elb ▴ 260

Hi guys I have a quite conceptual question on the significance of the comparison between two lists of genes. I have list1 and list2. Both lists contain 100 genes. The overlap is 13 genes and the universe is 17,611 genes (is the full set of genes from which the lists of 100 genes were derived). If I perform the Fisher's exact test to calculate the p-value of the overlap, i.e. the significance, it is p < 1.351e-14. It is strongly significant and I understand why if I consider the length of the universe. But if we consider the lists of 100 genes, the overlap is quite low: 13%. Should I consider the overlap finally significant or not?

Thank you in advance

fisher test overlap significance • 4.0k views
ADD COMMENT
0
Entering edit mode

You can also simply take two random sets of 100 genes from 17,611 universal set and check how often you observe an overlap of 13 or more. This gives a sense to you if 13 gene overlapping is not by chance.

You can do 1000 random sets of 100 genes and check how ofter there is an overlap of 13 or more genes.

Fishers test is doing something similar but by doing random sets, you get a better sense.

ADD REPLY
1
Entering edit mode
6.8 years ago
e.rempel ★ 1.1k

Hi,

as you said, the p-value of Fisher's exact test is highly significant. The reason for this, as you have also said, is the size of gene's universe.

I have seen some people making thoughts about the universe: can any random gene from these 17611 genes be selected in any of your lists? Sometimes that is not the case, e.g since some genes are silenced. But if you were able to restrict the gene's universe to 2000 genes (a very strong restriction, I admit), the Fisher's test would still be significant.

Thus, I would say the overlap is significant and there is some connection between your lists.

ADD COMMENT
1
Entering edit mode
6.8 years ago
Martombo ★ 3.1k

As already pointed out by e.rempel, you need to make sure that your comparison is not biased by independent variables like expression level. One way to achieve this is to apply a strong initial filter on expression, removing several genes from your universe, and use a statistical test that doesn't favour highly expressed genes. This kind of test is then only telling you that there is some kind of overlap between the two lists, but it doesn't actually quantify it. Where are these two lists coming from? If your aim is to measure how similar two comparisons / states are, you could correlate the effect sizes you get (eg fold-changes if they are transcriptional profiles).

ADD COMMENT

Login before adding your answer.

Traffic: 1630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6