Question

On the significance of the overlap between lists of genes

1

Entering edit mode

7.1 years ago

elb ▴ 260

Hi guys I have a quite conceptual question on the significance of the comparison between two lists of genes. I have list1 and list2. Both lists contain 100 genes. The overlap is 13 genes and the universe is 17,611 genes (is the full set of genes from which the lists of 100 genes were derived). If I perform the Fisher's exact test to calculate the p-value of the overlap, i.e. the significance, it is p < 1.351e-14. It is strongly significant and I understand why if I consider the length of the universe. But if we consider the lists of 100 genes, the overlap is quite low: 13%. Should I consider the overlap finally significant or not?

Thank you in advance

fisher test overlap significance • 4.3k views

ADD COMMENT • link updated 7.1 years ago by e.rempel ★ 1.1k • written 7.1 years ago by elb ▴ 260

0

Entering edit mode

You can also simply take two random sets of 100 genes from 17,611 universal set and check how often you observe an overlap of 13 or more. This gives a sense to you if 13 gene overlapping is not by chance.

You can do 1000 random sets of 100 genes and check how ofter there is an overlap of 13 or more genes.

Fishers test is doing something similar but by doing random sets, you get a better sense.

ADD REPLY • link 7.1 years ago by GouthamAtla 12k

score 1 · Answer 1 · 2018-03-06

Hi,

as you said, the p-value of Fisher's exact test is highly significant. The reason for this, as you have also said, is the size of gene's universe.

I have seen some people making thoughts about the universe: can any random gene from these 17611 genes be selected in any of your lists? Sometimes that is not the case, e.g since some genes are silenced. But if you were able to restrict the gene's universe to 2000 genes (a very strong restriction, I admit), the Fisher's test would still be significant.

Thus, I would say the overlap is significant and there is some connection between your lists.

score 1 · Answer 2 · 2018-03-06

As already pointed out by e.rempel, you need to make sure that your comparison is not biased by independent variables like expression level. One way to achieve this is to apply a strong initial filter on expression, removing several genes from your universe, and use a statistical test that doesn't favour highly expressed genes. This kind of test is then only telling you that there is some kind of overlap between the two lists, but it doesn't actually quantify it. Where are these two lists coming from? If your aim is to measure how similar two comparisons / states are, you could correlate the effect sizes you get (eg fold-changes if they are transcriptional profiles).