Hi,
I'm doing an online Omics data analysis course. I don't understand how specific parts of the code chunks below work, and I'd really appreciate someone helping me to understand this code, as plainly as possible. The aim is to compute a null distribution of gene set enrichment (GSEA) scores. To do this the samples are permuted, the statistical analysis is repeated, the genes are re-ordered according to the new statistics, and a new GSEA score is computed.
packages needed:
library(tidyverse)
library(DESeq2)
library(GO.db)
library(org.Hs.eg.db)
Choose the gene set of interest (this part is fine, no explanation needed):
GO_0005925 <- AnnotationDbi::select(org.Hs.eg.db,
keys = "GO:0005925",
columns = c("ENTREZID", "SYMBOL"),
keytype = "GO") %>%
as_tibble %>%
filter(!duplicated(ENTREZID)
I think I know what the next code chunk is doing. x
is the DESeq object. x$Condition
is the colData
column which specifies the experimental conditions - either 'mock' transfection (3 samples) or the same genetic perturbation (3 samples). There are 6 samples in total.
##' A function that assigns the permuted Conditions,
##' runs DESeq and a GSEA analysis
##'
##' @param x a DESeq object.
##' @param perm `character()` of length `ncol(x)` indicating the
##' permutation to be tested.
##'
##' @return `numeric()` containing the GSEA path.
gsea_perm <- function(x, perm) {
## Set the Condition based on the permutation
x$Condition[perm] <- "mock"
x$Condition[-perm] <- "KD"
## Run DESeq2, extract and annotate results
suppressMessages(
tbl <- DESeq(x) %>%
results(name = "Condition_KD_vs_mock") %>%
as_tibble(rownames = "ENSEMBL") %>%
left_join(ensembl_to_geneName) %>%
mutate(inGO = ENTREZID %in% GO_0005925$ENTREZID) %>%
filter(!is.na(ENTREZID),
!is.na(padj),
!duplicated(ENTREZID)) %>%
dplyr::select(ENSEMBL, ENTREZID, padj, inGO) %>%
arrange(padj)
)
## Compute GSEA results
tbl <- tbl %>%
mutate(score = if_else(inGO, set_ratio(tbl), -1)) %>%
mutate(score = cumsum(score))
return(tbl$score)
}
Because there are 6 samples, they are permuted like so. Only the first ten (out of twenty) permutations are subsequently used:
(perms <- combn(6, 3))
This next code chunk is the one I'm struggling to understand. dds
is a DESeq object. What is p
and why function(p)
in the code chunk? The function is only run on the first ten permutations. Scores are then calculated:
paths <- apply(perms[, 1:10], 2, function(p) gsea_perm(dds, p))
scores <- sapply(paths, max)
A plot is then made comparing the actual score to the permutation score. How is the first element known to be the actual score? I think that is because the first permutation in combn(6, 3)
above is '1, 2, 3' which happens to be the correct experimental design (samples 1, 2 and 3 are the 'mock' transfection), but I could be wrong.
plot(density(scores))
rug(scores)
abline(v = scores[1])
Apologies for the lengthy question but I would really be grateful for some help with the specific queries I raised. I know I can use ClusterProfiler for the same ends but I am keen not to gloss over this R code without understanding it thoroughly.