Entering edit mode
9 months ago
Apex92
▴
320
I have a data frame that looks like below.
Category,Total_Genes,UUAGGG_motif
Background,22591,18190
SetA,122,102
SetB,198,182
SetC,90,82
I have counted the number of motifs available in each category. Now I want to calculate three p-value (SetA vs Background - SetB vs background and SetC vs background) to see in which of the three categories the motif is enriched considering the size of each category.
I came up with this approach in R - is this the correct way? Thank you in advance.
library(hypeR)
# Number of background genes
N <- 22591
# Number of background genes with motif
K <- 18190
# Set A
n_A <- 122
k_A <- 102
# Set B
n_B <- 198
k_B <- 182
# Set C
n_C <- 90
k_C <- 82
# Perform hypergeometric test for Set A
p_value_A <- 1 - phyper(k_A - 1, K, N - K, n_A, lower.tail = TRUE)
# Perform hypergeometric test for Set B
p_value_B <- 1 - phyper(k_B - 1, K, N - K, n_B, lower.tail = TRUE)
# Perform hypergeometric test for Set C
p_value_C <- 1 - phyper(k_C - 1, K, N - K, n_C, lower.tail = TRUE)
This seems like a good time to use MEME suite of tools. You can use something like IUPAC2MEME to get your PWM then a tool like centrimo with your gene list and your background list.