I used the limma in R to find significant abundance changes for a dataset containing ~1000 entries, three replicates for both treated groups and control. However, the volcano plot shows a weird U-shape:
This means larger FC got higher significance, which is for sure not reasonable. But when I tested another pair of data with a similar size (~1000 entries, triplicates for both treatment and control), the results were fine. Why limma gives this U-shape result?
Here is the example file containing data that generates good and weird results:
And here is the R script I run:
df <- data.frame(RawFile)
df <- dplyr::filter(df, !grepl('#', A1))
df <- dplyr::filter(df, !grepl('#', A4))
df2 <- df %>%
remove_rownames %>%
column_to_rownames(var="Reference")
df2 <- mutate_all(df2, function(x) as.numeric(as.character(x)))
mat <- data.matrix(df2)
#Input design matrix
design <- model.matrix(~ 0+factor(c(1,1,1,2,2,2)))
colnames(design) <- c("GroupI","GroupJ")
cont_matrix <- makeContrasts(GroupIvsGroupJ = GroupI-GroupJ, levels=design)
#Start limma analysis
fit <- lmFit(mat, design)
fit_contrast <- contrasts.fit(fit,cont_matrix)
fit_contrast <- eBayes(fit_contrast)
#Export
top_genes <- topTable(fit_contrast, number = 10000, adjust = "BH")
write.csv(top_genes, file="Output.csv")
Is this proteomics, metabolomics, or something else? I'm guessing not anything -seq. In any case, why do you suspect that the results are not accurate? Does the data need to be normalized? What do the p-value and adj-p-value histograms look like?
Your design matrix and limma steps look ok to me. My gut is telling me what this reflects some underlying phenomena and/or the data violates some underlying assumptions of limma.