I begin with a matrix of genes (~30000 rows) vs cases (columns). I have binary - TRUE or FALSE - data in each cell indicating whether there is a copy number aberration spanning the gene for that case. Next, I sum each row to get the total number of cases from the sample set that have a copy number aberration in each gene. I will call this c for each gene. Next, I retrieve a list of genes from http://cbio.mskcc.org/CancerGenes/Select.action (say the Entrez Query: Stability list, which yields 1023 genes). How can I test whether c (the total number of cases with a copy number aberration for a given gene) among the Entrez Query: Stability list is significantly greater than c among the gene population in general? Furthermore, is there some way of calculating some uncertainty measure of my result based on the fact that not all genes actually underpinning stability are in the Entrez Query: Stability list? I am using R.
NB. My attempt (probably wrong and overly complicated):
- Bootstrap with replacement 100000 times to get 100000 samples each of size 1023 from the 30000 genes. The 1023 Stability genes are excluded from the population from which the genes for each sample are drawn.
- Use
var.test
across c for each sample vs the 1023 Stability gene list. If the test does not produce a significant value, that sample can be used in step 3. Samples that produce a significant value are discarded. This is done to satisfy the assumption of the Wilcoxon test that the variances between the two test samples are similar. - Do a Kolmogorov–Smirnov test (
ks.test
) for each sample brought forward vs the 1023 Stability gene list. If the test does not produce a significant value, that sample can be used in step 4. Samples that produce a significant value are discarded. This is done to satisfy the assumption of the Wilcoxon test that the distributions between the two test samples are similar. - Do a Wilcoxon test (
wilcox.test
) for each sample brought forward vs the 1023 Stability gene list. If the mean (or median?) p-value for these tests is significant, I can say that c is significantly greater for Stability genes than it is for non-Stability genes.
+1 about the confounding effect of length. Do you think perhaps this could be mitigated somewhat by having the cumulative length (in bps) of the genes in the 1023 Stability gene list be equal to the cumulative length of the genes in the sampled gene list, give or take 1%?
That's comforting. You could try a ranksum or KS test to check whether there are significant differences between the two distributions.