I am trying to plot a host-parasite-pathogen network using a sankey graph and I am having problems scaling the nodes to their occurence (counts). The main issue is that the host count is nested in the parasite counts which is nested in the pathogen detection counts, meaning the host group contains 62 individuals of 3 species, the parasite is in 103 NGS libraries (4 species), showing many detections classified in 19 groups of pathogens present in 95 parasite libraries.
Here is an example:
library(tidyverse)
library(ggsankey) # remotes::install_github("davidsjoberg/ggsankey")
df <- data.frame(
host_ID = c("CP01","CP01","CP02","CP03","CP03","CP03","CP04","CP04","CP05","CP06","CP07","CP08","CP09"),
host_sp = c("P. tricuspis", "P. tricuspis","P. tricuspis","S. gigantea","S. gigantea","S. gigantea","undetermined","undetermined","P. tricuspis", "P. tricuspis","P. tricuspis","S. gigantea","P. tricuspis"),
parasite_libraries = c("N141_01","N141_02","N141_03","N141_04","N141_05","N141_06","N141_07","N141_08","N141_09","N141_10","N141_011", "N141_012", "N141_013"),
parasite_sp = c("Am. compressum", "Am. compressum","Am. compressum","Ix. rasus","Am. compressum","Ix. rasus","Am. variegatum","Am. variegatum",
"Ix. rasus","Am. compressum","Rh. simpsoni","Am. compressum","Am. compressum"),
Rickettsia = c(0,1,0,1,1,1,1,0,0,1,1,0,0),
Borrelia = c(0,0,1,0,0,1,0,0,0,1,1,0,0),
Nairovirus = c(0,0,0,1,0,0,0,0,0,0,0,0,0),
Flavivirus = c(1,1,0,1,1,1,1,1,0,1,1,1,0),
Anellovirus = c(1,1,1,0,0,0,0,0,0,0,1,0,0),
Dicistrovirus = c(1,0,0,0,0,0,1,0,0,1,0,0,0),
Parvovirus = c(0,0,0,0,0,0,0,0,0,1,1,0,0),
Circovirus = c(1,0,0,0,1,0,1,1,0,0,1,0,0)
)
df_long <- df %>%
mutate(across(5:ncol(.), ~ ifelse(. != 0, cur_column(), 0))) %>%
pivot_longer(cols = c(5:ncol(.)), names_to = "Taxa", values_to = "detected_taxa")
host_counts <- df %>%
group_by(host_sp) %>%
summarise(host_count = n_distinct(host_ID), .groups = 'drop')
parasite_pools <- df %>%
group_by(parasite_sp) %>%
summarise(library_count = n_distinct(parasite_libraries), .groups = 'drop')
Detected <- df_long %>%
filter(!is.na(detected_taxa)) %>%
group_by(detected_taxa) %>%
summarise(detection_count = n(), .groups = "drop") %>%
rename(Taxa = detected_taxa)
df_long_2 <- df_long %>%
left_join(host_counts, by = "host_sp") %>%
left_join(parasite_pools, by = "parasite_sp") %>%
left_join(Detected, by = c("Taxa")) %>%
pivot_longer(host_count:detection_count, values_to = "counts")
sankey_data <- df_long_2 %>%
make_long(host_sp, parasite_sp, Taxa, value = counts)
ggplot(sankey_data, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node), # Use 'node' for coloring
value = value)) +
geom_sankey(flow.alpha = 0.5, node.color = "black", node.width = 1) +
geom_sankey_label(aes(label = node), size = 5) +
theme_sankey(base_size = 16) +
labs(x = NULL) +
theme(legend.position = "none")
As you can see the association between host and parasite is more accurate than the one between parasite and pathogen groups, e.g. every parasite block connects to all pathogen blocks, which is false.
The problem is introducted in the pathogen detection counts when pivoting longer. I would like to render the plot with the counts for each category, i.e number of hosts by unique host_ID, number of parasites by unique parasite_library and number of detection of each pathogen group.
Any help would be gratly appreciated
Appreciate the code and example data to make a reproducible example but you have to indicate the packages.
tidyverse
for sure, but where ismake_long
from?Sorry for that. I have added them in the question. ggsankey is the package I am using for this