I have a gene expression data set that can be broken down into genes that are significantly differential expressed (DE) after a experimental condition, and those that are not. I'm interested in knowing if the genes that are significantly DE occur around a subset of chromosomal locations of interest more often that the non significantly DE genes. I know I have to consider issues of gene length and inter-gene gap length - however, at the moment, I want a first pass test just to see if significantly DE genes occur near these locations more frequently than the non-significantly DE genes.
I have spent some time trying to figure out how to do this via a Chi-square permutation test. I've written R code that creates bins say, 100kb upstream and downstream from the locations, have created a frequency table of the significantly DE and non-significantly DE genes sorted into these bins. Because the proportion of non-significantly DE genes is larger than the significantly DE genes (I have about 100 significantly DE and about 1000 non-significantly DE genes), it was suggested to me that I randomly sample a 100 non-significantly DE genes, and run the Chi square test a 1000 times, randomly sampling the non-sig DE genes on each iteration. This is a bit deviation from the Chi square permutation test I am used to, which would randomly shuffle one variable on the contingency table to create a null distribution. Basically, I was told I should test only a subset of my data, and create the null based on random sampling of the complete set.
I have many questions (based on lots of failed R code), but they all stem from this one main question - is this approach an appropriate test for this aim?
Forgive me if this very basic. I am very, very new to genomics and have little background in the area.
How about this:
I'd just put together a gene neighbour network in cytoscape and use jActiveModules if I wanted a rough and ready answer.
Are you sure your candidate gene sets aren't correlated through some other technical artifact (shared probes / sequence identity)?