I have a data frame that represents the frequency of gene features in the background population, and several sub populations. For each sub-population, I want to do an enrichment analysis in R with correction for multiple comparisons, to identify significantly enriched gene features in each sub population (separately), relative to the background frequency for that gene feature. (i.e. are there gene features that are significantly over-represented in population 1 compared to the background? How about pop 2? etc.)
The intended result being a new data frame, with all gene features as the fist column, and three additional columns representing the enrichment p-values for each sub-population.
It seems like a hypergeometric \ fishers test would be appropriate here, but I'm unsure of how to implement it across this structure where frequencies are already computed and how to account for multiple comparisons.
In the example below each row represents a gene feature of interest (named in the "Gene_feature" col), where "Freq_background" represents the frequency of that gene feature across a whole population, and each data column (Freq_1, Freq_2, Freq_3...) represents the frequency of the gene feature in a sub-population of interest.
#generate example data
make_names <- function(n = 500) {
do.call(paste0, replicate(4, sample(LETTERS, n, TRUE), FALSE))
}
Gene_feature<- make_names(10)
Freq_background<- replicate(10, runif(n=1, min=1e-12, max=.9999999999))
Freq_pop1<- replicate(10, runif(n=1, min=1e-12, max=.9999999999))
Freq_pop2<- replicate(10, runif(n=1, min=1e-12, max=.9999999999))
Freq_pop3<- replicate(10, runif(n=1, min=1e-12, max=.9999999999))
df<- as.data.frame(cbind(Gene_feature, Freq_background, Freq_pop1, Freq_pop2, Freq_pop3))
head(df,3)
| Gene_feature | Freq_background | Freq_1 | Freq_2 | Freq_3 |
| :----------- | :-------------: | :-------------: | :-------------: | :-------------: |
|PTPA |0.613762008430914|0.791384656368987|0.978364152503174|0.466154247010611|
|FOKS |0.517032054201702|0.873722912889272|0.866081856570345|0.385693898852212|
|AQPI |0.708481218362451|0.216534708838529|0.901803070356394|0.976621562051737|
.. Intended output:
| Gene_feature | Freq_1_pval | Freq_2_pval | Freq_3_pval |
| :----------- | :----------: | :----------: | :----------: |
|PTPA |0.009 |0.001 |0.087 |
|FOKS |0.599 |0.079 |0.999 |
|AQPI |0.870 |0.998 |0.790 |
..
check out SKAT-O. If I understand your question, SKAT-O is a pretty strong implementation for the type of test you describe.
essentially you'd be running the "burden" test on genomic elements instead of single variants... https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3415556/