After finding differentially expressed genes (or methylation sites) that pass some Benjamini-Hochberg FDR cutoff, I often shuffle the data, re-run the pipeline and see how many (spurious) differently expressed genes are found that pass that cutoff.
Say we have a model like:
expression ~ disease + age + gender + race
Generally, I shuffle the entire clinical data set, so that each individual is associated a random expression vector. Is there and advantage to instead, shuffling just the single covariate of interest. So, in the case above, I'd shuffle only the disease covariate instead of shuffling all the individuals.
Any literature on this?
thanks for the ideas. I search on stats.stackexchange and found some things to look at, e.g.:http://en.wikipedia.org/wiki/Resampling_(statistics) . (I do see a difference between BH and permutation-based correction of pvalues)