In this tutorial, I'll try to provide a brief overview of the capabilities of our enrichment analysis R package pathfindR. The tool is in CRAN and its introductory vignette can be found here. We also have a detailed wiki.
pathfindR is designed to improve enrichment analysis by firstly identifying active subnetworks in differential expression/methylation data using a protein-protein interaction network. It then performs over-representation analysis using the identified active subnetworks. By utilizing the interaction information, the tool identifies numerous gene sets relevant to the condition under study.
I'll be using the example data provided in the package: a gene expression data comparing rheumatoid arthritis patients vs. controls.
Active-Subnetwork-Oriented Pathway Enrichment Analysis
As illustrated below, this workflow takes in a data frame of 3 columns containing: Gene Symbols
, Change Values
(optional) and p values
. It then maps these genes onto a protein-protein interaction network and identifies and filters active subnetworks. Using the genes in each active subnetwork, it performs enrichment analyses. The results are returned as a data frame, optionally presented in an HTML report.
We'll first load the library and the example data:
library(pathfindR)
head(example_pathfindR_input)
Gene.symbol logFC adj.P.Val
1 FAM110A -0.69394 3.4087e-06
2 RNASE2 1.35350 1.0085e-05
3 S100A8 1.54483 3.4664e-05
4 S100A9 1.02809 2.2626e-04
5 TEX261 -0.32360 2.2626e-04
6 ARHGAP17 -0.69193 2.7081e-04
Then we simply run the wrapper function run_pathfindR
for active-subnetwork oriented pathway enrichment analysis:
example_pathfindR_output <- run_pathfindR(example_pathfindR_input)
By default the gene sets used for enrichment analysis is KEGG pathways. The available gene sets are "KEGG", "Reactome", "BioCarta", "GO-All", "GO-BP", "GO-CC", "GO-MF" and "mmu_KEGG". The gene sets can be specified using the gene_sets
argument. Detailed information on all the arguments of run_pathfindR
is provided here.
The output of run_pathfindR()
is a data frame containing 8 (or 9) columns:
- ID: ID of the enriched term
- Term_Description: Description of the enriched term
- Fold_Enrichment: Fold enrichment value for the enriched term (Calculated using ONLY the input genes)
- occurrence: the number of iterations that the given term was found to enriched over all iterations
- lowest_p: the lowest adjusted-p value of the given term over all iterations
- highest_p: the highest adjusted-p value of the given term over all iterations
- non_Signif_Snw_Genes (OPTIONAL): the non-significant active subnetwork genes, comma-separated
- Up_regulated: the up-regulated genes (as determined by ‘change value' > 0, if the 'change column' was provided) in the input involved in the given term’s gene set, comma-separated. If change column not provided, all affected are listed here.
- Down_regulated: the down-regulated genes (as determined by ‘change value' < 0, if the 'change column' was provided) in the input involved in the given term’s gene set, comma-separated
By default, the function also plots a bubble chart of enrichment results. The example bubble chart of enrichment results for example_pathfindR_output
is presented below:
Optionally (if output_dir
is provided), the function also creates an HTML report named results.html
under the directory specified by the argument output_dir
. The report contains a table of the active subnetwork-oriented pathway enrichment results. This table contains the same information as the returned data frame.
Further, visualize_terms()
can be used to generate visualization of enriched terms. If hsa KEGG pathways were chosen, the visualization of each KEGG pathway with the gene nodes colored according to change values is created. The colored human KEGG pathway diagrams are created using the input genes. If another gene set is being used, the graph visualization of interactions of pathway genes (in the PIN) are plotted instead.
Enrichment Term Clustering
Enrichment analysis usually yields a large number of related terms. In order to establish representative terms among similar groups of terms, we provide functionality to cluster the enriched terms, as outlined below:
The wrapper function cluster_enriched_terms()
is used for clustering the enriched terms. This function first calculates the pairwise kappa statistics between the terms in the data frame obtained from run_pathfindR()
. It then clusters the terms and partitions them into relevant clusters. As can be seen in the diagram, there are two options for clustering: (1) hierarchical clustering and (2) fuzzy clustering.
Detailed information on the clustering functionality can be found here.
Hierarchical Clustering
By default, cluster_enriched_terms()
performs hierarchical clustering (using 1 - kappa statistic
as the distance metric), determines the optimal number of clusters by maximizing the average silhouette width and returns a data frame with cluster assignments. The representative terms are chosen as the ones with the lowest p value in each cluster.
example_pathfindR_output_clustered <- cluster_enriched_terms(example_pathfindR_output)
head(example_pathfindR_output_clustered, 2)
ID Term_Description Fold_Enrichment occurrence lowest_p highest_p Up_regulated
4 hsa04722 Neurotrophin signaling pathway 24.849 4 5.0626e-05 0.00066481
7 hsa04218 Cellular senescence 20.060 2 1.6810e-03 0.00168099
Down_regulated Cluster Status
4 CRKL, FASLG, SH2B3, ABL1, MAGED1, IRAK2, IKBKB, CALM1, CALM3 1 Representative
7 MTOR, ETS1, RBL2, CALM1, CALM3, NFATC3, SLC25A5, VDAC1 1 Member
An enrichment bubble chart grouped by clusters can be plotted via:
enrichment_chart(example_pathfindR_output_clustered, plot_by_cluster = TRUE)
Heuristic Fuzzy Multiple-linkage Partitioning
Alternatively, the fuzzy clustering method (as described in Huang DW, Sherman BT, Tan Q, et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8(9):R183.) can be used to obtain fuzzy cluster assignments and representative terms per each cluster:
example_pathfindR_output_clustered <- cluster_enriched_terms(example_pathfindR_output, method = "fuzzy")
Aggregated Term Scores per Sample
The function score_terms()
can be used to calculate the agglomerated z score of each enriched term per sample. This allows the user to individually examine the scores and infer how a term is overall altered (activated or repressed) in a given sample or a group of samples. The function uses:
- The output data frame obtain of
run_pathfindR()
, - The expression matrix used during analysis (differential expression/methylation),
- (Optional for ordering samples) a vector of "case" IDs.
Example usage is presented below:
## Vector of "Case" IDs cases <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708",
"GSM389711", "GSM389714", "GSM389716", "GSM389717",
"GSM389719", "GSM389721", "GSM389722", "GSM389724",
"GSM389726", "GSM389727", "GSM389730", "GSM389731",
"GSM389733", "GSM389735")
## Calculate scores for representative terms
## and plot heat map using term descriptions
score_matrix <- score_terms(enrichment_table = example_pathfindR_output_clustered[example_pathfindR_output_clustered$Status == "Representative", ],
exp_mat = example_experiment_matrix,
cases = cases,
use_description = TRUE, # default FALSE
label_samples = FALSE) # default = TRUE
pathfindR Analysis with Custom Gene Sets
As of the latest version, pathfindR offers utility functions for obtaining organism-specific PIN data and organism-specific gene sets data via
get_pin_file()
andget_gene_sets_list()
, respectively.
It is possible to use run_pathfindR()
with custom gene sets (including gene sets for non-Homo-sapiens species). Here, we provide an example application of active-subnetwork-oriented enrichment analysis of the target genes of two transcription factors.
We first load and prepare the gene sets:
## CREB target genes
CREB_target_genes <- normalizePath(system.file("extdata/CREB.txt", package = "pathfindR"))
CREB_target_genes <- readLines(CREB_target_genes)[-c(1, 2)] # skip the first two lines
## MYC target genes
MYC_target_genes <- normalizePath(system.file("extdata/MYC.txt", package = "pathfindR"))
MYC_target_genes <- readLines(MYC_target_genes)[-c(1, 2)] # skip the first two lines
## Prep for use
custom_genes <- list(TF1 = CREB_target_genes, TF2 = MYC_target_genes)
custom_descriptions <- c(TF1 = "CREB target genes", TF2 = "MYC target genes")
We next prepare the example input data frame. Because of the way we choose genes, we expect significant enrichment for MYC targets (40 MYC target genes + 10 CREB target genes). Because this is only an example, we also assign each genes random p-values between 0.001 and 0.05.
set.seed(123)
## Select 40 random genes from MYC gene sets and 10 from CREB gene sets
selected_genes <- sample(MYC_target_genes, 40)
selected_genes <- c(selected_genes,
sample(CREB_target_genes, 10))
## Assign random p value between 0.001 and 0.05 for each selected gene
rand_p_vals <- sample(seq(0.001, 0.05, length.out = 5),
size = length(selected_genes),
replace = TRUE)
input_df <- data.frame(Gene_symbol = selected_genes,
p_val = rand_p_vals)
Finally, we perform active-subnetwork-oriented enrichment analysis via run_pathfindR() using the custom genes as the gene sets:
custom_result <- run_pathfindR(input_df,
gene_sets = "Custom",
custom_genes = custom_genes,
custom_descriptions = custom_descriptions,
max_gset_size = Inf, # DO NOT LIMIT GENE SET SIZE
iterations = 1, # for demo, setting number of iterations to 1
output_dir = "misc/v1.4/CREB_MYC")
ID Term_Description Fold_Enrichment occurrence lowest_p highest_p
1 TF2 MYC target genes 15.133 1 9.3559e-14 9.3559e-14
2 TF1 CREB target genes 17.346 1 5.0187e-05 5.0187e-05
Up_regulated
1 AGRP, ATP6V1C1, C19orf54, CD3EAP, CNPY3, EPB41, FOXD3, FXR1, GLA, HNRNPD, HOXA7, IL1RAPL1, KDM6A, LONP1, LTBR, MDN1, MICU1, NET1, NEUROD6, NMNAT2, NOL6, NUDC, PEPD, PKN1, PSMB3, RPS19, RPS28, RRS1, SLC9A5, SMC3, STC2, TESK2, TNPO2, TOPORS, TPM2, TSSK3, WBP2, ZBTB8OS, ZFYVE26, ZHX2
2 BRAF, DIO2, ELAVL1, EPB41, FAM65A, FOXD3, NEUROD6, NOC4L, NUPL2, PPP1R15A, SYNGR3, TIPRL
Down_regulated
1
2
Closing remarks
In this tutorial I tried to provide an overview of pathfindR. I try to update this tutorial as certain changes and additions to the package are made. To keep this tutorial as brief as possible, I had to omit certain frequently asked issues (such as pathfindR analysis using non-human data). For such questions and any issues, please visit our website or submit an issue to our issues page on GitHub.
Also check out the vignettes included in the package via browseVignettes("pathfindR")
or find them here:
Your post is great and I need this information and tutorial. I appreciate.
Question, some error code like: "java development kit is needed" pumped up when I was running the run_pathfindR. Wondering how can I fix this? I already installed java on my Mac before!
Hi, can you paste the error here? Do you have JDK installed? Actually JRE should have been sufficient but please give it a try if not installed.
Hello there! I cannot go past this passage: pws_table <- RA_clustered[RA_clustered$Status == "Representative", ] RA_exp_mat <- data.frame(RA_input$logFC, row.names=RA_input$Gene.symbol) score_matrix <- calculate_pw_scores(RA_clustered, RA_exp_mat)
I set up my RA_exp_mat as in the example, but I think that my expression (which is in log2foldchange) is not cut for the job.. Isn't it? My score_Matrix is full of NAs
Hey,
The expression matrix is the matrix you used to perform the differential expression test. Something like:
For the example data this can be loaded by:
You provided the LFC values only therefore the scores could not be calculated.
I'm sorry but the "data(RA_exp_mat, package = "pathfindR")" is an example data, isn't it?
I'm guessing I should use my own expression data, isn't it? Does pathfinderR creates this expression data? Shall I calculate it myself? Could you happen to tell me how?
Yes you should use your own expression data. This is the data you used to perform differential expression analysis and obtain the logFC and p values. pathfindR does not (cannot) create the expression data as it only uses the logFC values provided for enrichment. The expression matrix (input of limma, DESeq2 or EdgeR etc.) should be provided if you want to calculate pathway scores.
aaaah! you are talking about the HtseqCounts! Got it, thank you ☺️
For some datasets it doesn't find any pathway. I think this is the related error 🤔 WARNING: Unexpected number format in experiment file line 1, discarded
P.S. pathFindeR does work in my hands so I guess it is just not able to find any related pathway to this dataset.
In some cases (especially if the number of input genes is low), pathfindR may not find any active subnetworks and therefore no enriched pathways. The message stating "Could not find any interactions for XX (XX%) genes in the PIN." is more relevant for genes not mapping onto the PIN.
At first place, thank you so much for this tutorial. I find it extremely useful. Would the analysis work with p-values instead of adjusted p-values?
thank you. glad you like the tutorial. Performing the analysis with p-values would be wrong, I would suggest instead to change the p value threshold (i.e. to change the argument
p_val_threshold
) to a higher value (e.g. 0.1).Hello,
Thanks for the nice tutorial. I have been using pathfindR successfully and was able to generate nice enrichments from my datasets. I have a question regarding the gene lists that are not found in the interaction – i.e. “Could not find any interactions for 36 (14.63%) genes in the PIN”.
I would like to know what these genes are. I have had a look on all the output .html files including: - “conversion.html”, - “results.html” and - "enriched_pathway.html" but all thee files are empty (they only contain the heading).
Do you know what the issue could be and how extract the 36 genes?
Thanks a lot in advance!
Hey Fil,
Glad you like the tutorial. I can't immediately think of any reason why the HTML outputs are empty but do open up an issue on our GitHub page if this persists.
For determining the 36 genes without interactions in the PIN, you may use the following:
Hope this helps, -E
that's great, Thanks a lot!!!
Hello @egeulgen Thank you for providing an interesting and easy-to-follow tutorial. I tried the given script for custom gene set analysis and I am getting the following error, could you please guide me if I am doing something wrong? I am copying-and-pasting the exact code that you shared:
Hello, Glad you found the tutorial useful. The function did not raise an error but rather gives a warning that no enriched gene sets were identified. I cannot tell, given only the above warnings, what might be the issue but I think the custom gene set you use for the analysis might be the problem. If you think that given your input and the custom gene set, you should find enriched term(s), please create an issue on our github page
Hi, thank you for your prompt response. Please note that I have actually tried your given (also above-mentioned) example of custome geneset/pathway creation and finding its enrichment. And I am getting the warning message that I posted in my question, however, I think I should get a table of enrichment for "MYC and CREB target genes".
hello again. Thank you for pointing that out! After the initial tutorial post, we made some changes on how the active subnetworks are filtered (that is more stringent). That's why the example above didn't work. I've updated the above example and made sure it works, should be fine now :)
Hello, really liking the tool so far. What base level is the log in logFC? Do you suggest log10 or log2? Thanks, A
Hey. Great to hear you liked the package. The logFC values are optional and are only used during KEGG pathway diagram visualization via pathview, which scales the values between -1 and 1. So the base should not matter. The active subnetwork search only utilizes the gene symbols and associated p values.
So the logFCs are only used to provide a direction of the expression change e.g. separate upregulated and down-regulated genes? I'm asking since I used spearman rhos instead of the fold changes which works very well by the way ;)
That is correct. The change values are only used for visualization purposes (indicating up/down change).
Hello @egeulgen,
many thanks for your great tutorials. I really like PathfindR! I have just one problem, how can I change the colors of the bubbles of the enrichment chart? From white to red is not really applicable for me. Changing colors for score_terms() is super easy. But how is it working for enrichment_chart()?
Sorry this question is maybe very basic, but I am also very new in the field of RNA seq and using R. Many thanks in advance. Best, Kamil.
Hey Kamil,
The result of
enrichment_chart()
is also a ggplot object so you can change the gradient colors ascale_color_gradient()
:Wow that was super easy! Many thanks.
Dear, thanks a lot for this tutorial. But, If I want to get the p adjusted for the pathways in the output result as you know the output contain lowest p value and highest p value not the p adjusted. How I can do that ? I tried to use p.adjust function in R, is it right way or not ? if not, is there any other way?
Hi,
I have two similar DEG lists, one worked very well with pathfindR, but the 2nd one failed for no apparent reason. Do you have any idea why I am getting this error?
This looks like an issue with the code. Would you mind opening an issue here: https://github.com/egeulgen/pathfindR/issues
Hi, sorry for the late reply. It seems ok so far, now working. I guess pathfindR doesn't like a short list of DEGs (< 150 genes). If anything happens, I'll let you know by opening a case in github. Thank you.
Hi, Thanks for the useful package and a wonderful tutorial. I have a small doubt. I assume that we only need to provide a list of differentially expressed genes along with Fold change and p values. What if I provide the complete output of DESeq as Input? Will it only consider the differentially expressed genes as there is a p_val_threshold?
Hey, thank you for your kind words. You're correct, if you provide the complete (not filtered) differential expression results, pathfindR filters the list per the
p_val_threshold
and will only consider those genes.On my system (Ubuntu 18.04) Pathfinder is unable to install due to an error raised by
magick
Do you have advices how to solve this error?
Please read the error message - you're missing magick on your system. Google "Ubuntu install magick".