This tutorial was 're-awoken' on June 27th 2020 due to popular demand. This is not for those who just started to learn R; although, if you are a beginner, you can certainly try it ('give it a shot').
This essentially allows one to create a nice figure from the results of DAVID enrichment. In the same figure, it plots information about the enrichment itself and also the differential expression analysis that was performed prior to enrichment.
Update June 28, 2020: I have learned from Ming Tang that clusterProfiler already has a function to enrich via DAVID, which should make this tutorial easier: DAVID functional analysis with clusterProfiler.
The data that I use for this tutorial was originally obtained at and exported from: https://david.ncifcrf.gov/
1, download files from GitHub
URL: https://github.com/kevinblighe/BiostarsMisc
- Post299161_DAVID.david
- Post299161_DEresults.csv
2, organise the results table
topTable <- read.table('Post299161_DEresults.csv',
header = TRUE, stringsAsFactors = FALSE, sep = ',')
rownames(topTable) <- topTable$Gene
topTable <- topTable[,-1]
head(topTable)
baseMean log2FoldChange lfcSE stat pvalue
WASH7P 583.27005 -1.8435993 0.2472482 -7.456473 8.886921e-14
MIR6859-1 14.80311 -1.6200682 0.9084491 -1.783334 7.453196e-02
RP11-34P13.15 38.84875 0.2174369 0.4546701 0.478230 6.324865e-01
RP11-34P13.16 26.67088 0.5864731 0.4809985 1.219283 2.227369e-01
RP11-34P13.13 11.03403 1.9163340 0.6305029 3.039374 2.370703e-03
FO538757.1 217.66152 -1.2191319 0.3031257 -4.021869 5.773809e-05
padj
WASH7P 4.028322e-13
MIR6859-1 1.011515e-01
RP11-34P13.15 6.845619e-01
RP11-34P13.16 2.731684e-01
RP11-34P13.13 4.058267e-03
FO538757.1 1.214672e-04
3, select genes of interest
# select by fold-change and adjustedp-value cut-off
sigGenes <- rownames(subset(topTable,
abs(log2FoldChange)>=2 & padj<=0.05))
head(sigGenes)
# ..or just take top 75 by adjusted p-value
sigGenes <- rownames(topTable[order(topTable$padj, decreasing = FALSE),])[1:75]
NB - not all of the selected genes may end up in the final figure. It depeds on which genes you used during DAVID enrichment, and which terms were then statistically significantly enriched.
4, read in DAVID functional enrichment results
DAVID <- read.table('Post299161_DAVID.david',
sep = '\t', header = TRUE, stringsAsFactors = FALSE)
colnames(DAVID)
head(DAVID)
# subset the DAVID results for the top enrichments
DAVIDsubset <- subset(DAVID, Benjamini <= 0.001)
DAVIDsubset <- DAVIDsubset[,c(1,2,3,6)]
For this tutorial, let's further filter these for just the enriched KEGG pathways
DAVIDsubset <- DAVIDsubset[grep('^hsa', DAVIDsubset$Term),]
5, create a new data-frame that has '1
' for when a gene is part of a term, and '0
' when not
annGSEA <- data.frame(row.names = sigGenes)
for (j in 1:length(sigGenes)) {
# create a matching pattern to ensure genes match exactly
# '^GENE,' --> Match at beginning of matching string
# ', GENE$' --> Match at end of matching string
# 'GENE,' --> Match between first and last gene in matching string
gene <- sigGenes[j]
pattern <- paste('^', gene, ', |, ', gene, '$| ', gene, ',', sep = '')
for (k in 1:nrow(DAVIDsubset)) {
if (any(grepl(pattern, DAVIDsubset$Genes[k]))) {
annGSEA[j,k] <- 1
} else {
annGSEA[j,k] <- 0
}
}
}
colnames(annGSEA) <- DAVIDsubset[,2]
# remove terms with no overlapping genes
annGSEA <- annGSEA[,apply(annGSEA, 2, mean)!=0]
# remove genes with no overlapping terms
annGSEA <- annGSEA[apply(annGSEA, 1, mean)!=0,]
annGSEA[1:5,1:5]
6, match the order of rownames in toptable
with that of annGSEA
topTableAligned <- topTable[which(rownames(topTable) %in% rownames(annGSEA)),]
topTableAligned <- topTableAligned[match(rownames(annGSEA), rownames(topTableAligned)),]
all(rownames(topTableAligned) == rownames(annGSEA))
7, create heatmap annotations
require(ComplexHeatmap)
require(circlize)
First, let's create the annotation for the genes, which will comprise:
- colour bar representing -log10(adjusted p-value) for each gene from differential expression analysis
- as above but for log2 fold-change
# colour bar for -log10(adjusted p-value) for sigGenes
dfMinusLog10FDRGenes <- data.frame(-log10(
topTableAligned[which(rownames(topTableAligned) %in% rownames(annGSEA)), 'padj']))
dfMinusLog10FDRGenes[dfMinusLog10FDRGenes == 'Inf'] <- 0
# colour bar for fold changes for sigGenes
dfFoldChangeGenes <- data.frame(
topTableAligned[which(rownames(topTableAligned) %in% rownames(annGSEA)), 'log2FoldChange'])
# merge both
dfGeneAnno <- data.frame(dfMinusLog10FDRGenes, dfFoldChangeGenes)
colnames(dfGeneAnno) <- c('Gene score', 'Log2FC')
dfGeneAnno[,2] <- ifelse(dfGeneAnno$Log2FC > 0, 'Up-regulated',
ifelse(dfGeneAnno$Log2FC < 0, 'Down-regulated', 'Unchanged'))
colours <- list(
'Log2FC' = c('Up-regulated' = 'royalblue', 'Down-regulated' = 'yellow'))
haGenes <- rowAnnotation(
df = dfGeneAnno,
col = colours,
width = unit(1,'cm'),
annotation_name_side = 'top')
Now a separate colour bar for the DAVID enrichment Benjamini p-value. This will also contain the enriched term names via annot_text()
# colour bar for -log10(Benjamini enrichment Q value) for DAVID results
dfMinusLog10BenjaminiTerms <- data.frame(-log10(
DAVID[which(DAVID$Term %in% colnames(annGSEA)), 'Benjamini']))
colnames(dfMinusLog10BenjaminiTerms) <- 'Enrichment\nterm score'
haTerms <- HeatmapAnnotation(
df = dfMinusLog10BenjaminiTerms,
Term = anno_text(
colnames(annGSEA),
rot = 45,
just = 'right',
gp = gpar(fontsize = 12)),
annotation_height = unit.c(unit(1, 'cm'), unit(8, 'cm')),
annotation_name_side = 'left')
8, now generate the heatmap
hmapGSEA <- Heatmap(annGSEA,
name = 'DAVID GO enrichment',
split = dfGeneAnno[,2],
col = c('0' = 'white', '1' = 'forestgreen'),
rect_gp = gpar(col = 'grey85'),
cluster_rows = TRUE,
show_row_dend = TRUE,
row_title = 'Top Genes',
row_title_side = 'left',
row_title_gp = gpar(fontsize = 11, fontface = 'bold'),
row_title_rot = 90,
show_row_names = TRUE,
row_names_gp = gpar(fontsize = 11, fontface = 'bold'),
row_names_side = 'left',
row_dend_width = unit(35, 'mm'),
cluster_columns = TRUE,
show_column_dend = TRUE,
column_title = 'Enriched terms',
column_title_side = 'top',
column_title_gp = gpar(fontsize = 12, fontface = 'bold'),
column_title_rot = 0,
show_column_names = FALSE,
show_heatmap_legend = FALSE,
clustering_distance_columns = 'euclidean',
clustering_method_columns = 'ward.D2',
clustering_distance_rows = 'euclidean',
clustering_method_rows = 'ward.D2',
bottom_annotation = haTerms)
draw(hmapGSEA + haGenes,
heatmap_legend_side = 'right',
annotation_legend_side = 'right')
Kevin
Hi@Kevin now since i have to do pairwise comparison im back it im doing every thing as you have mentioned after running the for loop i get this
when i see annGSEA i get this
"disulfide bond Disulfide bond glycosylation site:N-linked (GlcNAc...) Glycoprotein signal peptide GO:0005615~extracellular space Polymorphism Signal hsa05310:Asthma topological domain:Extracellular GO:0005886~plasma membrane Receptor GO:0005887~integral component of plasma membrane Membrane
as my colnames and my first column being my gene names so all my ontology columns are zero im not sure what im doing wrong..How to export in tsv file, when I click on download option it opens a new window with tex.
There is no option for saving as.
Hi Kevin,
I am glad I found your thread, as it is basically what I am trying to do/present my data, but being new to R so I am finding it tricky.
I have Normalised gene expression data (mRNA) and I have managed to use DAVID to cluster the Genes, and have the Enrichment file.
I want to display the mean normalised values rather than significance. i.e A heatmap to display upregulated genes and another heatmap to display downregulated genes in the pathways.
I don't know how to statistically analyse the data using DESeq2, I have calculated the P-Value on Excel, so I only have the Mean of my normalised data set and the P-Value (no base mean, no log 2 fold change, no adjusted P-value), is it possible to generate a heatmap with just the data sets I have? I just want to have simple heatmaps, as I want to see which pathway/s are mostly dysregulated in response to my treatment, and then start to create other graphical representations of interesting pathways/genes.
I have tried to follow your example above, but R gives me an error for rldMatrix (saying it could not find the function). I have managed to write the codes for the rest, and it seems to be fine. Just finding the initial bit (data organisation) tricky.
Thank you for your help. Sehar :)
@Kevin you have a topMatrix object here , but i did clustering and taking some of the clustered gene set and giving that as input to david so what is my top matrix as im doing wgcna so there is no differential expression analysis involved so the only thing i have gene list from modules ..any suggestion
Very useful and a great way to look that data will multi feature. Will give it a try. Great resource to be bookmarked.
this the code exact code im running ,my "MyEnrichment.tsv" contains the DAVID output which i get after giving gene list for enrichment with cutoff range .so im not sure what im doing wrong
This is the error now im getting
which is now gone now im getting a new error
The
is empty ...
Hello @kevin, Thank you for the code ! It is really helpful My question is every time I plot new dataset with this code, colors for the DEG sig score are keep changing. How can I fix the color of DEG significance color bar (as well as GO term score bar), Thanks in advance,
Hello Kevin Blighe, I have one question. I did Functional Annotation Clustering and i got 100 clusters. But on the annotation results Ensemble ID is coming . Is there any way to replace those ensemble ID to gene name from functional annotation clustering output
Many thanks Kevin for this great tutorial. I went through all the step (just for up-regulated genes) and the image that I got at the end is like this. Why at the bottom of the right side there are some muddled text. How can I remove them? I tried but I couldn't. I really appreciate your time and help.
This is usually just a sizing issue, a lot of the times I can solve these problems by saving as a legal-sized pdf or increasing the size of my image in the save. Then you can always downsize it into your manuscript or paper.
Hi I am quite new with coding and cant seem to be able to solve/understand this error
I've copied and pasted the code you have provided so I've no idea what went wrong.
It would be great if you could share the updated version of the tutorial for all since I'm sure there will be many beginners like me. I'm assuming the output you are referring to is before enrichment cutoff.
Hi Kevin, thanks for this great tutorial. However, can you please elaborate on how did you calculated the gene score? Thanks
Hello Suren, the gene score is the negative log10 of the adjusted p-value for each gene, calculated in this line:
Thanks Kevin. I understand now.
Hi Kevin, Can you mention your papers where such figures were used? How to show the foldchange value (or expression intensity) instead of green rectangular boxed. Thanks
This was originally used in a PhD thesis, so, not published as such. If you just want a heatmap based on fold-changes, I would just generate a fold-change matrix and then used ComplexHeatmap in the standard way.
thanks Kevin .....:)
Hello @Kevin how can i download DAVID enrichment file in .david format ?
can we do the same with multiple samples instead of genes? Like samples as column and pathway as row?
This is a really great piece of code, Kevin, thank you! I had to modify the first few lines from the files on github to get things going, so sharing here in case the holdup of unique row names is catching anyone else new to R. Reminder to everyone that to pull a csv from github you just need to open the raw file, save as excel and excel does a pretty good job with the rest, don't let troubleshooting some small problems with the csv file hold you back from experimenting with this type of heatmap.
The bar for normalised enrichment score is overlaying the enriched terms. what could be the reason for this error?
You will have to elaborate on what is the error