Measure Significance Of Genes Associated With Multiple Phenotypes
2
3
Entering edit mode
13.2 years ago

I have a dataset of gene-phenotype association in this format. I am looking at some combination of phenotypes and genes shared between combinations. I would like to use a statistical test to show that the genes shared between two phenotypes are statistically significant using a p-value or a similar measure.

For example:

22 genes are associated with Phenotype1 
205 genes are associated with Phenotype2 
9 genes are common between two phenotypes

I want to assess whether the number of genes common to two phenotypes are statistically significant or just a random observation.

I have phenotype information for 4035 genes; I assume that human genome contains 42, 071 genes

How do you address this problem (preferably in R), what statistical test you would recommend and why ?

PS. Edit on Oct 17 2011 I posted this question at stats.stackexchange.com.

statistics statistics • 3.5k views
ADD COMMENT
3
Entering edit mode

@Khader: That's the number of current entries in the gene database for Homo sapiens, which includes pseudogenes (e.g. LOC100736412), neathderthal mitochondrial genes (trnL) and hypothetical proteins (e.g. DKFZP564C152). Just FYI, since those classes of genes would not typically be used to generate the phenotype-genotype gene lists and might inflate your number of comparisons.

ADD REPLY
2
Entering edit mode

This is a great question, very pertinent. Sure, you can assume that the genome is 42071 genes, but were all tested? You may need to lower that because not all genes are represented on genotyping and gene expression platforms. Such may be a reason for whole genome sequencing to identify variants and their associations as well as RNA-Seq for gene expression.

ADD REPLY
1
Entering edit mode

Thanks Larry. Good point, but here I used 42071 genes because my phenotype also includes diseases. Gene-disease relationship was determined using biochemical experiments, not as such from array-based or sequence based experimental platforms.

ADD REPLY
1
Entering edit mode

Although the definition of a gene is slippery, the conventional number for "how many protein-coding genes are in the human genome?" is about 25,000. Where did you get 42,071?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Yes David, thanks for your pointers. I agree using entire set of genes form NCBI may affect my analysis. In my dataset, I have associations with LOC*, hypothetical ones but not tmL. I will check this and refine it to further.

ADD REPLY
0
Entering edit mode

Please note that I cross-posted this question here: stats.stackexchange.com/questions/17132/statistical-significance-of-genes-associated-with-multiple-phenotypes

ADD REPLY
2
Entering edit mode
13.2 years ago

For our situation, which is quite similar to yours, we use a Z-score. This is described by Doniger, Conklin, et al and gives a measure of the significance of overlap between two sets. Generally, a Z-score of 1.96 means positive enrichment at p-value roughly equal to 0.05, while a negative Z-score is negative enrichment (much less overlap than expected), also with p about equal to 0.05. As Z increases in either direction, significance increases.

ADD COMMENT
1
Entering edit mode

A Z-score of 1.96 comes from a normal distribution (or a standard normal variate). How do you validate the assumptions for normal distribution?

ADD REPLY
0
Entering edit mode

Thanks a lot for this, will check the manuscript.

ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thanks Adrian !!

ADD REPLY

Login before adding your answer.

Traffic: 1935 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6