Good day,
I have 2 lists of genes that I wish to compare in terms of
- overlap between the genes themselves
- overlap between the their respective pathways after ORA using innateDB
Gene list A consists of approximately 500 differentially expressed genes (DEGs) from a microarray analysis with over 17000 probes and Gene list B is a list of 200 susceptibility genes drawn from various GWAS. I have converted all genes in both lists to their Ensembl ID so that I can compare like-with-like.
Firstly, to find the overlap between the genes themselves, I have then used Venny to identify that 75% of the GWAS genes are represented in the microarray platform. Venny has also showed me that 15 genes overlap between the DEG and GWAS gene lists but I understand that I also need to perform a statistical test to ensure that this overlap is not just due to chance. However, I am unsure of what test I should perform and what program I should use.
Unfortunately, I only have slight experience with MINITAB and SPSS and no experience with R.
Secondly, I have done a pathway over-expression analysis using innateDB on both DEG and GWAS genes lists. The DEG list gave ~150 over-expressed pathways while the GWAS list gave ~100. Using the unique pathway identifier, I again compared the 2 pathway lists (1 from each gene list) and found that ~50 overlapped. Again a statistical test should be done but would this be the same test as the first instance?
Thank you for all your help!
Tl;dr: I have overlapped 2 gene lists and found that they overlap in terms of genes and pathways but do not know what statistical test should be used to give evidence that these results are not due to chance.
Hypergeometric distribtuion test can be used to test for the overlap. See here: Probability of gene list overlap and here: http://stats.stackexchange.com/questions/16247/calculating-the-probability-of-gene-list-overlap-between-an-rna-seq-and-a-chip-c.
Hello and thank you for your reply!
I read your recommendations about hypergeometric distribution testing and found them extremely helpful!
Is it therefore correct to say that since my situation is such that I have
However only 75% - i.e. 150 of these GWAS genes - are represented on the platform and the overlap of genelist A and B is 15 genes, so my hypergeometric distribution test should be looking for:
The probability of getting 15 or more white balls in a sample of size 150 from an urn with 500 white balls and 16500 black balls, assuming H0.
Therefore, if I used R it would want my input to be:
Also, I was wondering if I could use hypergeometric distribution test to test for the overlap between the 2 lists of pathways derived from pathway over-expression analysis (ORA) based on the 2 gene lists? If so, what would the N be? Would it be the total number of possible pathways you could get from the ORA?
Thank you for your time!
Yes, you have formulated it correctly. I have never used the R phyper so not sure what corresponds to what in parameters but I am sure you have done it correctly. Well theoretically you can apply the same test to pathways but there may be some concerns. For example, only small number of genes from a group are over-represented in the pathways. Lets assume only 20 genes of 150 GWAS genes are part of pathways and may not represent the whole group. I am not sure if you can still apply some test directly.