Question

Calculating statistical significance of overlap between two groups of genes - adapting hypergeometric test for my situation?

2

Entering edit mode

8.2 years ago

Rubal ▴ 350

Hello!

I have several lists of genes from a test I have run on genomes of different organisms. I have calculated the percentage of sharing between these lists in a pairwise fashion. I want to test whether the pairwise sharing is greater than expected by chance. The hypergeometric test appears to be the standard approach to do this. However, the versions of the test I have seen require you to input as background the number of genes in the genome. As this varies between each of lists (each is from a different species) I do not think I can implement the standard versions of the test available online.

I have been trying to generate distributions of expected sharing using resampling from my actual background genelists as I think this is the most appropriate solution (see http://stats.stackexchange.com/questions/232627/want-to-calculate-significance-of-pairwise-sharing-between-lists-standard-hyper ) but calculating so many pairwise comparisons is currently beyond my coding ability. To me resampling lists of the same size as those I observe from my test from the appropriate genomic background for each species then calculating the overlap from these subsamplings is the best possible approach to test whether the overlaps I observe in the real data are significant.

Any advice of tools I might be able to use, comments on the validity of my approach, or help with the code would be very appreciated.

Thanks for your assistance.

r genome genes overlap significance • 6.9k views

ADD COMMENT • link updated 8.2 years ago by Nicolas Rosewick 11k • written 8.2 years ago by Rubal ▴ 350

score 1 · Answer 1 · 2016-08-31

1

Entering edit mode

8.2 years ago

Nicolas Rosewick 11k

In R it's quiet straghtforward in fact. Something like this should work (not tested)

geneListA # vector of genes from organism A
geneListB # vectof of genes from organism B
N <- 1000 # simulation iteration

gA # list of genes of interest from organism A
gB # list of genes of interest from organism B

inter.r <- length(intersect(gA,gB)) # number of gene in common 

getSimIntersect <- function(gA,gB,geneListA,geneListB){
  gA.sim <- sample(geneListA,length(gA))
  gB.sim <- sample(geneListB,length(gB))
  return(length(intersect(gA.sim,gB.sim)))
}

inter.sim <- replicate(N,getSimIntersect(gA,gB,geneListA,geneListB))

p <- sum(inter.sim>=inter.r)/N

ADD COMMENT • link 8.2 years ago by Nicolas Rosewick 11k

0

Entering edit mode

thanks so much, I'll play around with this.

ADD REPLY • link 8.2 years ago by Rubal ▴ 350

0

Entering edit mode

One issue I have is that for some lists the sampling overlap is never as high as the observed overlap, so P-values are reported as 0. I would still like a way to assess whether some lists have a lot more sharing compared to random sampling than others. For example if in some cases up to 50% of genes are overlapping in the resampling compared to 60% in the real data, that is different than if a maximun of are 2% observed as overlapping in the resampling, but in both cases p-values are reported as 0. Is it straightforward to modify the code to output something like the mean percentage sharing between lists from across the resamplings? Then one could assess the magnitude of difference between the sampling and the observed data in cases where p-values are <0.001. Many thanks

ADD REPLY • link 8.2 years ago by Rubal ▴ 350

0

Entering edit mode

p value is not = 0 . If you do N=100,000 iterations the only thing you can say is that p < 1e-06 if you do not see any simulated intersection bigger than the real one. Nevertheless you can report the mean simulated intersection also.

ADD REPLY • link 8.2 years ago by Nicolas Rosewick 11k

0

Entering edit mode

I completely agree, just meant that R reports it as 0 but ofcourse you can only say it's smaller than your number of iterations.

ADD REPLY • link 8.2 years ago by Rubal ▴ 350

0

Entering edit mode

exactly. Maybe you could restrict your geneListA and B by taking into account for expression (only take expressed genes for example) in order to be more close to the reality. But that depends on your original question you want to answer.

ADD REPLY • link 8.2 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Good thought, though this is for genomic analyses of DNA not RNA, so I have to include the entire genomic background as in theory any gene could appear. I think all sets are exposed to the same underlying biases that might results in a base level of sharing that is above the amount seen from random sampling. Therefore I think it's important to quantify which lists show an even higher level of sharing than the others even if all of them are technically significantly more than observed by chance.

ADD REPLY • link 8.2 years ago by Rubal ▴ 350

0

Entering edit mode

just to confirm, I think for the line starting 'gBsim <-' you mean to sample from geneListB not geneListA correct? Also , for the subsampling section I get the error: Error in FUN(X[[i]], ...) : unused argument (X[[i]]) Trying to troubleshoot it now

ADD REPLY • link 8.2 years ago by Rubal ▴ 350

0

Entering edit mode

do you have any idea what might be causing this error? I checked that all the input files are present and correct.

ADD REPLY • link 8.2 years ago by Rubal ▴ 350

0

Entering edit mode

I modify the code to use replicate instead of sapply. It should work now

ADD REPLY • link 8.2 years ago by Nicolas Rosewick 11k

0

Entering edit mode

great this works. Thanks a lot! I have 10 lists and want to do all pairwise comparisons between them. I would plan to reasign geneListA ,geneListB, gA and gB each time and rerun this. Is there a much more efficient way? Don't worry if not, I'm happy to do it this way. Thanks again

ADD REPLY • link 8.2 years ago by Rubal ▴ 350