Question

Tutorial:Clustering of DAVID gene enrichment results from gene expression studies

64

Entering edit mode

7.1 years ago

Kevin Blighe 89k

This tutorial was 're-awoken' on June 27th 2020 due to popular demand. This is not for those who just started to learn R; although, if you are a beginner, you can certainly try it ('give it a shot').

This essentially allows one to create a nice figure from the results of DAVID enrichment. In the same figure, it plots information about the enrichment itself and also the differential expression analysis that was performed prior to enrichment.

Update June 28, 2020: I have learned from Ming Tang that clusterProfiler already has a function to enrich via DAVID, which should make this tutorial easier: DAVID functional analysis with clusterProfiler.

The data that I use for this tutorial was originally obtained at and exported from: https://david.ncifcrf.gov/

1, download files from GitHub

URL: https://github.com/kevinblighe/BiostarsMisc

Post299161_DAVID.david
Post299161_DEresults.csv

2, organise the results table

topTable <- read.table('Post299161_DEresults.csv',
  header = TRUE, stringsAsFactors = FALSE, sep = ',')
rownames(topTable) <- topTable$Gene
topTable <- topTable[,-1]
head(topTable)

               baseMean log2FoldChange     lfcSE      stat       pvalue
WASH7P        583.27005     -1.8435993 0.2472482 -7.456473 8.886921e-14
MIR6859-1      14.80311     -1.6200682 0.9084491 -1.783334 7.453196e-02
RP11-34P13.15  38.84875      0.2174369 0.4546701  0.478230 6.324865e-01
RP11-34P13.16  26.67088      0.5864731 0.4809985  1.219283 2.227369e-01
RP11-34P13.13  11.03403      1.9163340 0.6305029  3.039374 2.370703e-03
FO538757.1    217.66152     -1.2191319 0.3031257 -4.021869 5.773809e-05
                      padj
WASH7P        4.028322e-13
MIR6859-1     1.011515e-01
RP11-34P13.15 6.845619e-01
RP11-34P13.16 2.731684e-01
RP11-34P13.13 4.058267e-03
FO538757.1    1.214672e-04

3, select genes of interest

# select by fold-change and adjustedp-value cut-off
sigGenes <- rownames(subset(topTable,
  abs(log2FoldChange)>=2 & padj<=0.05))
head(sigGenes)

# ..or just take top 75 by adjusted p-value
sigGenes <- rownames(topTable[order(topTable$padj, decreasing = FALSE),])[1:75]

NB - not all of the selected genes may end up in the final figure. It depeds on which genes you used during DAVID enrichment, and which terms were then statistically significantly enriched.

4, read in DAVID functional enrichment results

DAVID <- read.table('Post299161_DAVID.david',
  sep = '\t', header = TRUE, stringsAsFactors = FALSE)
colnames(DAVID)
head(DAVID)

# subset the DAVID results for the top enrichments
DAVIDsubset <- subset(DAVID, Benjamini <= 0.001)
DAVIDsubset <- DAVIDsubset[,c(1,2,3,6)]

For this tutorial, let's further filter these for just the enriched KEGG pathways

DAVIDsubset <- DAVIDsubset[grep('^hsa', DAVIDsubset$Term),]

5, create a new data-frame that has '`1`' for when a gene is part of a term, and '`0`' when not

annGSEA <- data.frame(row.names = sigGenes)
for (j in 1:length(sigGenes)) {
  # create a matching pattern to ensure genes match exactly
    #  '^GENE,'  --> Match at beginning of matching string
    #  ', GENE$'  --> Match at end of matching string
    #  'GENE,'  --> Match between first and last gene in matching string
  gene <- sigGenes[j]
  pattern <- paste('^', gene, ', |, ', gene, '$| ', gene, ',', sep = '')
  for (k in 1:nrow(DAVIDsubset)) {
    if (any(grepl(pattern, DAVIDsubset$Genes[k]))) {
      annGSEA[j,k] <- 1
    } else {
      annGSEA[j,k] <- 0
    }
  }
}
colnames(annGSEA) <- DAVIDsubset[,2]

# remove terms with no overlapping genes
annGSEA <- annGSEA[,apply(annGSEA, 2, mean)!=0]

# remove genes with no overlapping terms
annGSEA <- annGSEA[apply(annGSEA, 1, mean)!=0,]

annGSEA[1:5,1:5]

6, match the order of rownames in `toptable` with that of `annGSEA`

topTableAligned <- topTable[which(rownames(topTable) %in% rownames(annGSEA)),]
topTableAligned <- topTableAligned[match(rownames(annGSEA), rownames(topTableAligned)),]
all(rownames(topTableAligned) == rownames(annGSEA))

7, create heatmap annotations

require(ComplexHeatmap)
require(circlize)

First, let's create the annotation for the genes, which will comprise:

colour bar representing -log10(adjusted p-value) for each gene from differential expression analysis
as above but for log2 fold-change

# colour bar for -log10(adjusted p-value) for sigGenes
dfMinusLog10FDRGenes <- data.frame(-log10(
  topTableAligned[which(rownames(topTableAligned) %in% rownames(annGSEA)), 'padj']))
dfMinusLog10FDRGenes[dfMinusLog10FDRGenes == 'Inf'] <- 0

# colour bar for fold changes for sigGenes
dfFoldChangeGenes <- data.frame(
  topTableAligned[which(rownames(topTableAligned) %in% rownames(annGSEA)), 'log2FoldChange'])

# merge both
dfGeneAnno <- data.frame(dfMinusLog10FDRGenes, dfFoldChangeGenes)
colnames(dfGeneAnno) <- c('Gene score', 'Log2FC')
dfGeneAnno[,2] <- ifelse(dfGeneAnno$Log2FC > 0, 'Up-regulated',
  ifelse(dfGeneAnno$Log2FC < 0, 'Down-regulated', 'Unchanged'))
colours <- list(
  'Log2FC' = c('Up-regulated' = 'royalblue', 'Down-regulated' = 'yellow'))
haGenes <- rowAnnotation(
  df = dfGeneAnno,
  col = colours,
  width = unit(1,'cm'),
  annotation_name_side = 'top')

Now a separate colour bar for the DAVID enrichment Benjamini p-value. This will also contain the enriched term names via annot_text()

# colour bar for -log10(Benjamini enrichment Q value) for DAVID results
dfMinusLog10BenjaminiTerms <- data.frame(-log10(
  DAVID[which(DAVID$Term %in% colnames(annGSEA)), 'Benjamini']))
colnames(dfMinusLog10BenjaminiTerms) <- 'Enrichment\nterm score'
haTerms <- HeatmapAnnotation(
  df = dfMinusLog10BenjaminiTerms,
  Term = anno_text(
    colnames(annGSEA),
    rot = 45,
    just = 'right',
    gp = gpar(fontsize = 12)),
  annotation_height = unit.c(unit(1, 'cm'), unit(8, 'cm')),
  annotation_name_side = 'left')

8, now generate the heatmap

hmapGSEA <- Heatmap(annGSEA,
  name = 'DAVID GO enrichment',
  split = dfGeneAnno[,2],

  col = c('0' = 'white', '1' = 'forestgreen'),

  rect_gp = gpar(col = 'grey85'),

  cluster_rows = TRUE,
  show_row_dend = TRUE,
  row_title = 'Top Genes',
  row_title_side = 'left',
  row_title_gp = gpar(fontsize = 11, fontface = 'bold'),
  row_title_rot = 90,
  show_row_names = TRUE,
  row_names_gp = gpar(fontsize = 11, fontface = 'bold'),
  row_names_side = 'left',
  row_dend_width = unit(35, 'mm'),

  cluster_columns = TRUE,
  show_column_dend = TRUE,
  column_title = 'Enriched terms',
  column_title_side = 'top',
  column_title_gp = gpar(fontsize = 12, fontface = 'bold'),
  column_title_rot = 0,
  show_column_names = FALSE,

  show_heatmap_legend = FALSE,

  clustering_distance_columns = 'euclidean',
  clustering_method_columns = 'ward.D2',
  clustering_distance_rows = 'euclidean',
  clustering_method_rows = 'ward.D2',

  bottom_annotation = haTerms)

draw(hmapGSEA + haGenes,
  heatmap_legend_side = 'right',
  annotation_legend_side = 'right')

Kevin

enrichment david • 23k views

ADD COMMENT • link 11 months ago by Kevin Blighe 89k

1

Entering edit mode

Hi@Kevin now since i have to do pairwise comparison im back it im doing every thing as you have mentioned after running the for loop i get this
when i see annGSEA i get this "disulfide bond Disulfide bond glycosylation site:N-linked (GlcNAc...) Glycoprotein signal peptide GO:0005615~extracellular space Polymorphism Signal hsa05310:Asthma topological domain:Extracellular GO:0005886~plasma membrane Receptor GO:0005887~integral component of plasma membrane Membrane as my colnames and my first column being my gene names so all my ontology columns are zero im not sure what im doing wrong..

ADD REPLY • link 6.8 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

How to export in tsv file, when I click on download option it opens a new window with tex.

ADD REPLY • link 7.1 years ago by bikash2510 ▴ 30

0

Entering edit mode

DAVID There is no option for saving as.

ADD REPLY • link 7.1 years ago by bikash2510 ▴ 30

0

Entering edit mode

Hi Kevin,

I am glad I found your thread, as it is basically what I am trying to do/present my data, but being new to R so I am finding it tricky.

I have Normalised gene expression data (mRNA) and I have managed to use DAVID to cluster the Genes, and have the Enrichment file.

I want to display the mean normalised values rather than significance. i.e A heatmap to display upregulated genes and another heatmap to display downregulated genes in the pathways.

I don't know how to statistically analyse the data using DESeq2, I have calculated the P-Value on Excel, so I only have the Mean of my normalised data set and the P-Value (no base mean, no log 2 fold change, no adjusted P-value), is it possible to generate a heatmap with just the data sets I have? I just want to have simple heatmaps, as I want to see which pathway/s are mostly dysregulated in response to my treatment, and then start to create other graphical representations of interesting pathways/genes.

I have tried to follow your example above, but R gives me an error for rldMatrix (saying it could not find the function). I have managed to write the codes for the rest, and it seems to be fine. Just finding the initial bit (data organisation) tricky.

Thank you for your help. Sehar :)

ADD REPLY • link 6.9 years ago by ssajid • 0

0

Entering edit mode

@Kevin you have a topMatrix object here , but i did clustering and taking some of the clustered gene set and giving that as input to david so what is my top matrix as im doing wgcna so there is no differential expression analysis involved so the only thing i have gene list from modules ..any suggestion

ADD REPLY • link 6.8 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Very useful and a great way to look that data will multi feature. Will give it a try. Great resource to be bookmarked.

ADD REPLY • link 6.8 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

log2FCcutoff <- 3
    BHcutoff <- 0.05
    sigGeneList <- subset(topTable, abs(log2FoldChange)>=log2FCcutoff & padj<=BHcutoff)[,1]




head(sigGeneList)


DAVIDfile <- "MyEnrichment.tsv"
DAVID <- read.table(DAVIDfile, sep="\t", header=TRUE)
colnames(DAVID)


enrichBcutoff <- 0.05
DAVID <- subset(DAVID, Benjamini<enrichBcutoff)
DAVID <- DAVID[,c(1,2,3,6)]



annGSEA <- data.frame(row.names=sigGeneList)
for (j in 1:length(sigGeneList))
{
  #Create a matching pattern to ensure genes match exactly
  #   ^GENE,  --> Match at beginning of matching string
  #   , GENE$ --> Match at end of matching string
  #    GENE,  --> Match between first and last gene in matching string
  gene <- sigGeneList[j]
  #pattern <- paste("^", gene, ", |, ", gene, "$| ", gene, ",", sep="")
  pattern <- paste("^", gene, ", |, ", gene, "$| ", gene, ",", sep="")

  for (k in 1:nrow(DAVID))
  {
    if (any(grep(pattern, DAVID$Genes[k])))
    {
      annGSEA[j,k] <- 1
    }
    else
    {
      annGSEA[j,k] <- 0
    }
  }
}
colnames(annGSEA) <- DAVID[,2]

#Remove terms with no overlapping genes
annGSEA <- annGSEA[,apply(annGSEA, 2, mean)!=0]

#Remove genes with no overlapping terms
annGSEA <- annGSEA[apply(annGSEA, 1, mean)!=0,]

this the code exact code im running ,my "MyEnrichment.tsv" contains the DAVID output which i get after giving gene list for enrichment with cutoff range .so im not sure what im doing wrong

ADD REPLY • link 6.8 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

This is the error now im getting

Error in grepl(pattern, DAVID$Genes[k]) :

which is now gone now im getting a new error

dfMinusLog10FDRGenes[dfMinusLog10FDRGenes=="Inf"] <- 0

Error in matrix(if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn,  : 
  length of 'dimnames' [2] not equal to array extent

The

dfMinusLog10FDRGenes

is empty ...

ADD REPLY • link 6.8 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Hello @kevin, Thank you for the code ! It is really helpful My question is every time I plot new dataset with this code, colors for the DEG sig score are keep changing. How can I fix the color of DEG significance color bar (as well as GO term score bar), Thanks in advance,

ADD REPLY • link 6.4 years ago by skylinesky ▴ 10

0

Entering edit mode

Hello Kevin Blighe, I have one question. I did Functional Annotation Clustering and i got 100 clusters. But on the annotation results Ensemble ID is coming . Is there any way to replace those ensemble ID to gene name from functional annotation clustering output

ADD REPLY • link 5.6 years ago by the_cowa ▴ 40

0

Entering edit mode

Many thanks Kevin for this great tutorial. I went through all the step (just for up-regulated genes) and the image that I got at the end is like this. Why at the bottom of the right side there are some muddled text. How can I remove them? I tried but I couldn't. I really appreciate your time and help.

enter image description here

ADD REPLY • link 5.5 years ago by Raheleh ▴ 260

0

Entering edit mode

This is usually just a sizing issue, a lot of the times I can solve these problems by saving as a legal-sized pdf or increasing the size of my image in the save. Then you can always downsize it into your manuscript or paper.

ADD REPLY • link 3.2 years ago by Kristy • 0

0

Entering edit mode

Hi I am quite new with coding and cant seem to be able to solve/understand this error

Error in grepl(pattern, DAVID$Genes[k]) : invalid regular expression '^c(220.7243738, 65.54562172,

I've copied and pasted the code you have provided so I've no idea what went wrong.

annGSEA <- data.frame(row.names=sigGeneList)
for (j in 1:length(sigGeneList))
{
  gene <- sigGeneList[j]
  pattern <- paste("^", gene, ", |, ", gene, "$| ", gene, ",", sep="")
  for (k in 1:nrow(DAVID))
  {
    if (any(grepl(pattern, DAVID$Genes[k])))
    {
      annGSEA[j,k] <- 1
    }
    else
    {
      annGSEA[j,k] <- 0
    }
  }
}

ADD REPLY • link 5.3 years ago by Dorothy • 0

0

Entering edit mode

It would be great if you could share the updated version of the tutorial for all since I'm sure there will be many beginners like me. I'm assuming the output you are referring to is before enrichment cutoff.

> str(DAVID)
'data.frame':   495 obs. of  13 variables:
 $ Category       : Factor w/ 13 levels "BBID","BIOCARTA",..: 12 4 12 13 4 12 12 8 6 13 ...
 $ Term           : Factor w/ 495 levels "109.Chemokine_families",..: 30 94 33 469 95 51 462 332 78 32 ...
 $ Count          : int  35 42 120 116 42 139 84 36 28 104 ...
 $ %              : num  10.8 13 37.2 35.9 13 ...
 $ PValue         : num  4.30e-27 5.77e-22 2.22e-21 2.67e-20 2.93e-20 ...
 $ Genes          : Factor w/ 392 levels "ADAMTS14, RSPO3, UNC5C, THBS1",..: 14 16 270 268 71 314 235 15 373 269 ...
 $ List Total     : int  304 279 304 297 279 304 304 159 276 297 ...
 $ Pop Hits       : int  190 379 3434 3346 421 4551 1965 243 176 2917 ...
 $ Pop Total      : int  20581 16792 20581 20063 16792 20581 20581 6879 16881 20063 ...
 $ Fold Enrichment: num  12.47 6.67 2.37 2.34 6 ...
 $ Bonferroni     : num  1.20e-24 9.51e-19 6.21e-19 2.57e-17 4.84e-17 ...
 $ Benjamini      : num  1.20e-24 9.51e-19 3.11e-19 2.57e-17 2.42e-17 ...
 $ FDR            : num  5.69e-24 9.70e-19 2.93e-18 4.20e-17 4.93e-17 ...

> DAVID[1:5,1:5]
          Category                             Term Count        %   PValue
1      UP_KEYWORDS                         Cytokine    35 10.83591 4.30e-27
2 GOTERM_BP_DIRECT GO:0006954~inflammatory response    42 13.00310 5.77e-22
3      UP_KEYWORDS                   Disulfide bond   120 37.15170 2.22e-21
4   UP_SEQ_FEATURE                   signal peptide   116 35.91331 2.67e-20
5 GOTERM_BP_DIRECT       GO:0006955~immune response    42 13.00310 2.93e-20

ADD REPLY • link 5.3 years ago by Dorothy • 0

0

Entering edit mode

Hi Kevin, thanks for this great tutorial. However, can you please elaborate on how did you calculated the gene score? Thanks

ADD REPLY • link 4.8 years ago by Suren ▴ 110

0

Entering edit mode

Hello Suren, the gene score is the negative log10 of the adjusted p-value for each gene, calculated in this line:

dfMinusLog10FDRGenes <- data.frame(-log10(
  topTableAligned[which(rownames(topTableAligned) %in% rownames(annGSEA)), 'padj']))

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

1

Entering edit mode

Thanks Kevin. I understand now.

ADD REPLY • link 4.8 years ago by Suren ▴ 110

0

Entering edit mode

Hi Kevin, Can you mention your papers where such figures were used? How to show the foldchange value (or expression intensity) instead of green rectangular boxed. Thanks

ADD REPLY • link 4.7 years ago by Suren ▴ 110

1

Entering edit mode

This was originally used in a PhD thesis, so, not published as such. If you just want a heatmap based on fold-changes, I would just generate a fold-change matrix and then used ComplexHeatmap in the standard way.

ADD REPLY • link 4.7 years ago by Kevin Blighe 89k

1

Entering edit mode

thanks Kevin .....:)

ADD REPLY • link 4.7 years ago by Suren ▴ 110

0

Entering edit mode

Hello @Kevin how can i download DAVID enrichment file in .david format ?

ADD REPLY • link 4.6 years ago by user.bioinfo • 0

0

Entering edit mode

can we do the same with multiple samples instead of genes? Like samples as column and pathway as row?

ADD REPLY • link 4.0 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

This is a really great piece of code, Kevin, thank you! I had to modify the first few lines from the files on github to get things going, so sharing here in case the holdup of unique row names is catching anyone else new to R. Reminder to everyone that to pull a csv from github you just need to open the raw file, save as excel and excel does a pretty good job with the rest, don't let troubleshooting some small problems with the csv file hold you back from experimenting with this type of heatmap.

##Organize the results table

topTable <- read.table('Post299161_DEresults.csv', header = TRUE, stringsAsFactors = FALSE, sep = ',')
row.names(topTable) <- paste(make.unique(topTable[,1]))
topTable <- topTable[,-1]
head(topTable)

ADD REPLY • link updated 11 months ago by GenoMax 150k • written 3.2 years ago by Kristy • 0

0

Entering edit mode

The bar for normalised enrichment score is overlaying the enriched terms. what could be the reason for this error?

1, download files from GitHub

2, organise the results table

3, select genes of interest

4, read in DAVID functional enrichment results

5, create a new data-frame that has '1' for when a gene is part of a term, and '0' when not

6, match the order of rownames in toptable with that of annGSEA

7, create heatmap annotations

8, now generate the heatmap

5, create a new data-frame that has '`1`' for when a gene is part of a term, and '`0`' when not

6, match the order of rownames in `toptable` with that of `annGSEA`