Question

Sankey plot for unequal node sizes

0

Entering edit mode

29 days ago

Luca Arbore ▴ 10

enter image description here I am trying to plot a host-parasite-pathogen network using a sankey graph and I am having problems scaling the nodes to their occurence (counts). The main issue is that the host count is nested in the parasite counts which is nested in the pathogen detection counts, meaning the host group contains 62 individuals of 3 species, the parasite is in 103 NGS libraries (4 species), showing many detections classified in 19 groups of pathogens present in 95 parasite libraries.

Here is an example:

library(tidyverse)
library(ggsankey) # remotes::install_github("davidsjoberg/ggsankey")

df <- data.frame(
  host_ID = c("CP01","CP01","CP02","CP03","CP03","CP03","CP04","CP04","CP05","CP06","CP07","CP08","CP09"),
  host_sp = c("P. tricuspis", "P. tricuspis","P. tricuspis","S. gigantea","S. gigantea","S. gigantea","undetermined","undetermined","P. tricuspis", "P. tricuspis","P. tricuspis","S. gigantea","P. tricuspis"),
  parasite_libraries = c("N141_01","N141_02","N141_03","N141_04","N141_05","N141_06","N141_07","N141_08","N141_09","N141_10","N141_011", "N141_012", "N141_013"),
  parasite_sp = c("Am. compressum", "Am. compressum","Am. compressum","Ix. rasus","Am. compressum","Ix. rasus","Am. variegatum","Am. variegatum",
                  "Ix. rasus","Am. compressum","Rh. simpsoni","Am. compressum","Am. compressum"),
  Rickettsia = c(0,1,0,1,1,1,1,0,0,1,1,0,0),
  Borrelia = c(0,0,1,0,0,1,0,0,0,1,1,0,0),
  Nairovirus = c(0,0,0,1,0,0,0,0,0,0,0,0,0),
  Flavivirus = c(1,1,0,1,1,1,1,1,0,1,1,1,0),
  Anellovirus = c(1,1,1,0,0,0,0,0,0,0,1,0,0),
  Dicistrovirus = c(1,0,0,0,0,0,1,0,0,1,0,0,0),
  Parvovirus = c(0,0,0,0,0,0,0,0,0,1,1,0,0),
  Circovirus = c(1,0,0,0,1,0,1,1,0,0,1,0,0)
)

df_long <- df %>%
  mutate(across(5:ncol(.), ~ ifelse(. != 0, cur_column(), 0))) %>% 
  pivot_longer(cols = c(5:ncol(.)), names_to = "Taxa", values_to = "detected_taxa")

host_counts <- df %>%
  group_by(host_sp) %>%
  summarise(host_count = n_distinct(host_ID), .groups = 'drop')

parasite_pools <- df %>%
  group_by(parasite_sp) %>%
  summarise(library_count = n_distinct(parasite_libraries), .groups = 'drop')

Detected <- df_long %>%
  filter(!is.na(detected_taxa)) %>% 
  group_by(detected_taxa) %>%
  summarise(detection_count = n(), .groups = "drop") %>%
  rename(Taxa = detected_taxa) 

df_long_2 <- df_long %>%
  left_join(host_counts, by = "host_sp") %>%
  left_join(parasite_pools, by = "parasite_sp") %>%
  left_join(Detected, by = c("Taxa")) %>%
  pivot_longer(host_count:detection_count, values_to = "counts") 

sankey_data <- df_long_2 %>%
  make_long(host_sp, parasite_sp, Taxa, value = counts) 

ggplot(sankey_data, aes(x = x, 
                        next_x = next_x, 
                        node = node, 
                        next_node = next_node, 
                        fill = factor(node),  # Use 'node' for coloring
                        value = value)) +
  geom_sankey(flow.alpha = 0.5, node.color = "black", node.width = 1) +
  geom_sankey_label(aes(label = node), size = 5) +
  theme_sankey(base_size = 16) +
  labs(x = NULL) +
  theme(legend.position = "none")

As you can see the association between host and parasite is more accurate than the one between parasite and pathogen groups, e.g. every parasite block connects to all pathogen blocks, which is false.

The problem is introducted in the pathogen detection counts when pivoting longer. I would like to render the plot with the counts for each category, i.e number of hosts by unique host_ID, number of parasites by unique parasite_library and number of detection of each pathogen group.

Any help would be gratly appreciated

R sankey network ggplot • 308 views

ADD COMMENT • link 26 days ago by Luca Arbore ▴ 10

0

Entering edit mode

Appreciate the code and example data to make a reproducible example but you have to indicate the packages. tidyverse for sure, but where is make_long from?

ADD REPLY • link 29 days ago by ATpoint 87k

0

Entering edit mode

Sorry for that. I have added them in the question. ggsankey is the package I am using for this

ADD REPLY • link 29 days ago by Luca Arbore ▴ 10