I have a dataset of gene-phenotype association in this format. I am looking at some combination of phenotypes and genes shared between combinations. I would like to use a statistical test to show that the genes shared between two phenotypes are statistically significant using a p-value or a similar measure.
For example:
22 genes are associated with Phenotype1
205 genes are associated with Phenotype2
9 genes are common between two phenotypes
I want to assess whether the number of genes common to two phenotypes are statistically significant or just a random observation.
I have phenotype information for 4035 genes; I assume that human genome contains 42, 071 genes
How do you address this problem (preferably in R), what statistical test you would recommend and why ?
PS. Edit on Oct 17 2011 I posted this question at stats.stackexchange.com.
@Khader: That's the number of current entries in the gene database for Homo sapiens, which includes pseudogenes (e.g. LOC100736412), neathderthal mitochondrial genes (trnL) and hypothetical proteins (e.g. DKFZP564C152). Just FYI, since those classes of genes would not typically be used to generate the phenotype-genotype gene lists and might inflate your number of comparisons.
This is a great question, very pertinent. Sure, you can assume that the genome is 42071 genes, but were all tested? You may need to lower that because not all genes are represented on genotyping and gene expression platforms. Such may be a reason for whole genome sequencing to identify variants and their associations as well as RNA-Seq for gene expression.
Thanks Larry. Good point, but here I used 42071 genes because my phenotype also includes diseases. Gene-disease relationship was determined using biochemical experiments, not as such from array-based or sequence based experimental platforms.
Although the definition of a gene is slippery, the conventional number for "how many protein-coding genes are in the human genome?" is about 25,000. Where did you get 42,071?
@David: The number is from NCBI (See: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606&lvl=3&lin=f&keep=1&srchmode=1&unlock). 25K may indicate reviewed proteins in human proteome: http://www.uniprot.org/uniprot/?query=organism:9606+keyword:181
Yes David, thanks for your pointers. I agree using entire set of genes form NCBI may affect my analysis. In my dataset, I have associations with LOC*, hypothetical ones but not tmL. I will check this and refine it to further.
Please note that I cross-posted this question here: stats.stackexchange.com/questions/17132/statistical-significance-of-genes-associated-with-multiple-phenotypes