Hi all,
I did phylogenetic comparative analysis of genome size and environmental factors using more than 120 species. The function I used is from the package phylolm:
mod <- phylolm(log_GS ~ factor_scaled, data = df, phy = target_tree, model = "BM", boot = 10000)
The result is non-significant. In order to better understand the required effect size to detect with my current sample size, I've been exploring power analysis using simulation. I found a code snippet for power analysis, which is quite straightforward. In the following code, I assume that Y is genome size, and X is environmental factor in my case.
my_power_function <- function(effect, sample_size) {
sig_results <- c()
for (i in 1:500) {
X <- runif(sample_size, 0, 1)
Y <- effect * X + rnorm(sample_size, 0, 3)
tib <- data.frame(X = X, Y = Y)
model <- feols(Y ~ X, data = tib, se = 'hetero')
sig_results[i] <- tidy(model)$p.value[2] <= 0.05
}
sig_results %>% mean() %>% return()
}
power_levels <- c()
effects_to_try <- c(.4, .8, 1.2, 1.6, 2)
for (i in 1:5) {
power_levels[i] <- my_power_function(effects_to_try[i], 120)
}
power_results <- data.frame(effect = effects_to_try,
power = power_levels)
However, I am facing some challenges in generating the simulated dataset. The environmental factor in my study are not uniformly or normally distributed, and I am also uncertain about how to incorporate the residual variance with the phylogenetic signal.
If anyone has any ideas or clues on how I can modify the code or any other approaches to achieve my goal, I would greatly appreciate your input. Thanks in advance!