I have a set of large genomic regions (~200, ~50-500kb) for which we hypothesize that there is higher RNA expression than expected by chance. The expression pertains to small RNAs, but let's make it broader and just think about RNA-seq coverage. An similar experiment would be if one had a set of ChIP-Seq peaks for a chromatin factor and wanted to test if genes overlapping those peaks have a higher changed of being (more) expressed.
My question is how best to test this hypothesis?
What I have done so far
I have estimated the coverage in those regions of interest, and as a control I used shuffleBed
to obtain 100 sets random regions (matched by size and chromosome), and also obtained the coverage for those regions. I could now average the coverage for each for the random sampling of each of these regions, and perform a statistical test (possibly non-parametric, but it depends on distribution of the sampling). Would be a good / ok / wrong way to go about it and are there better ways?
For me this problem sounds a bit vague. Density of genes is different across the genome. If you compare 50kb region, containing one gene, with a 50kb region without any genes (and I'd say the latter is more probable), you will get the conclusion that your region is "more expressed" than the background - while it will mean essentially nothing. I'd say you need to reformulate your question. If you choose regions, containing the same length of transcriptome inside, you'd be better, but not much - the question is essentially complex. Side note: you don't need to apply any test if you perform permutations, you get a p-value "by definition".
You are right. However, in my particular case, I am not looking at genes, but rather piRNA expression. These very peculiar small RNAs are not expressed in defined units like genes are (except in some species where the primary piRNAs are indeed produced from expression units). I will also test for overlap of those regions with the locations of transposable elements, which piRNAs target and are produced by, but this is a much simpler comparison: do two sets of genomic regions overlap significantly.
Then it is correct =) use a permutation approach - http://danielnee.com/2015/01/random-permutation-tests/ - and you'll be fine. You may probably use the approach similar to https://www.nature.com/articles/nature14221/figures/1 , but with the expression values instead of mutational density