How To Determine The Statistical Significance Of Overlap (Intersect) Between Three Sets
3
5
Entering edit mode
11.7 years ago
bsmith030465 ▴ 240

I have three overlapping sets and I want to find the probability of finding a larger/greater intersection for 'A intersect B intersect C' (in the example below, I want to find the probability of finding more than 135 elements that are common in sets A, B & C). For a two set problem, I guess I would do a Fisher or chi-square test. Here is what I have attempted so far:

### Prepare a 3 way contingency table:
mytable <- array(c(135,116,385,6256,
                    48,97,274,9555),
                  dim = c(2,2,2),
                  dimnames = list(
                    Is_C = c('Yes','No'),
                    Is_B = c('Yes','No'),
                    Is_A = c('Yes','No')))

## test
mantelhaen.test(myrabbit, exact = TRUE, alternative = "greater")

Is this the right test (alongwith the current parameters) to determine what I want or is there a more appropriate test for this?

statistics r • 14k views
ADD COMMENT
1
Entering edit mode

I was going to suggest you post this also at cross-validated, but then I saw this! Glad biostars are more responsive...

ADD REPLY
0
Entering edit mode

I'm interested to hear what other say as to wether mantelhaen is the right test there. Don't forget if your sets are genomic intervals, the standard methods are less likely to apply due to the non-randomness of the genome. e.g. if all 3 of your datasets are likely to occur in gene-bodies, then that is the relationship, but it will make them appear to be co-occuring if you're considering the entire genome as the background.

ADD REPLY
0
Entering edit mode

Each set consists of a group of genes, and I'm trying to see if the overlap is significant. All the sets are drawn from the full complement of genes across the genome (~17k). Does that answer your question?

ADD REPLY
0
Entering edit mode

Can you tell us if you are looking for genomic overlap?

ADD REPLY
4
Entering edit mode
11.7 years ago
brentp 24k

I think you probably want the multivariate version of the hypergeometric. You can find an implementation and documentation on that for R here:

http://rss.acs.unt.edu/Rdoc/library/BiasedUrn/html/BiasedUrn-3-Multivariate.html

ADD COMMENT
0
Entering edit mode

For a strawman case, if we assume that there is no bias, I'm not sure if the above models will apply.

ADD REPLY
1
Entering edit mode

that may well be. can you elaborate? bias is a loaded term.

ADD REPLY
1
Entering edit mode
11.7 years ago

The approaches described in this report - Annotation Enrichment Analysis: An Alternative Method for Evaluating the Functional Properties of Gene Sets - may be useful to you, depending on exactly what you're comparing and where you want to take the results. The report is available here. Although the authors discuss an approach in dealing with gene set enrichment using GO terms when not all genes are equally annotated, the approach could be applied to other labels of the entities for which you looking for overlap/enrichment.

ADD COMMENT
0
Entering edit mode

That looks like an interesting paper. In the abstract, they say that their method "is able to predict biologically meaningful results that are obscured by the many false-positive enrichment scores that occur in FET (Fisher's Exact Test)...." I wonder if simply using a FDR with FET would correct for some of this. I've done this in the past, but a quick search to find some support for this idea turns up this related paper with a potentially useful Perl package (from the same paper) for doing these calculations.

ADD REPLY
0
Entering edit mode

FDR (which we often employ) and FET may be adequate. We have not yet done what is described in the paper to which I linked, but intend to. It is an interesting approach indeed.

ADD REPLY
0
Entering edit mode

Interesting paper - will go into it a little later.

At the moment, I'm just trying to get a 'strawman' probability. If we assume independence and no bias (i.e. assume that there are ~17k numbered balls in an urn) , what is the probability of finding greater than 135 balls that are common in all the three draws?

Although blatantly incorrect from a biological/genetic point of view, this is just one number that I'll be presenting...

Thanks for the replies!

ADD REPLY

Login before adding your answer.

Traffic: 1714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6