Sankey plot for unequal node sizes
0
0
Entering edit mode
29 days ago
Luca Arbore ▴ 10

enter image description hereI am trying to plot a host-parasite-pathogen network using a sankey graph and I am having problems scaling the nodes to their occurence (counts). The main issue is that the host count is nested in the parasite counts which is nested in the pathogen detection counts, meaning the host group contains 62 individuals of 3 species, the parasite is in 103 NGS libraries (4 species), showing many detections classified in 19 groups of pathogens present in 95 parasite libraries.

Here is an example:

library(tidyverse)
library(ggsankey) # remotes::install_github("davidsjoberg/ggsankey")

df <- data.frame(
  host_ID = c("CP01","CP01","CP02","CP03","CP03","CP03","CP04","CP04","CP05","CP06","CP07","CP08","CP09"),
  host_sp = c("P. tricuspis", "P. tricuspis","P. tricuspis","S. gigantea","S. gigantea","S. gigantea","undetermined","undetermined","P. tricuspis", "P. tricuspis","P. tricuspis","S. gigantea","P. tricuspis"),
  parasite_libraries = c("N141_01","N141_02","N141_03","N141_04","N141_05","N141_06","N141_07","N141_08","N141_09","N141_10","N141_011", "N141_012", "N141_013"),
  parasite_sp = c("Am. compressum", "Am. compressum","Am. compressum","Ix. rasus","Am. compressum","Ix. rasus","Am. variegatum","Am. variegatum",
                  "Ix. rasus","Am. compressum","Rh. simpsoni","Am. compressum","Am. compressum"),
  Rickettsia = c(0,1,0,1,1,1,1,0,0,1,1,0,0),
  Borrelia = c(0,0,1,0,0,1,0,0,0,1,1,0,0),
  Nairovirus = c(0,0,0,1,0,0,0,0,0,0,0,0,0),
  Flavivirus = c(1,1,0,1,1,1,1,1,0,1,1,1,0),
  Anellovirus = c(1,1,1,0,0,0,0,0,0,0,1,0,0),
  Dicistrovirus = c(1,0,0,0,0,0,1,0,0,1,0,0,0),
  Parvovirus = c(0,0,0,0,0,0,0,0,0,1,1,0,0),
  Circovirus = c(1,0,0,0,1,0,1,1,0,0,1,0,0)
)

df_long <- df %>%
  mutate(across(5:ncol(.), ~ ifelse(. != 0, cur_column(), 0))) %>% 
  pivot_longer(cols = c(5:ncol(.)), names_to = "Taxa", values_to = "detected_taxa")

host_counts <- df %>%
  group_by(host_sp) %>%
  summarise(host_count = n_distinct(host_ID), .groups = 'drop')

parasite_pools <- df %>%
  group_by(parasite_sp) %>%
  summarise(library_count = n_distinct(parasite_libraries), .groups = 'drop')

Detected <- df_long %>%
  filter(!is.na(detected_taxa)) %>% 
  group_by(detected_taxa) %>%
  summarise(detection_count = n(), .groups = "drop") %>%
  rename(Taxa = detected_taxa) 

df_long_2 <- df_long %>%
  left_join(host_counts, by = "host_sp") %>%
  left_join(parasite_pools, by = "parasite_sp") %>%
  left_join(Detected, by = c("Taxa")) %>%
  pivot_longer(host_count:detection_count, values_to = "counts") 

sankey_data <- df_long_2 %>%
  make_long(host_sp, parasite_sp, Taxa, value = counts) 

ggplot(sankey_data, aes(x = x, 
                        next_x = next_x, 
                        node = node, 
                        next_node = next_node, 
                        fill = factor(node),  # Use 'node' for coloring
                        value = value)) +
  geom_sankey(flow.alpha = 0.5, node.color = "black", node.width = 1) +
  geom_sankey_label(aes(label = node), size = 5) +
  theme_sankey(base_size = 16) +
  labs(x = NULL) +
  theme(legend.position = "none")

As you can see the association between host and parasite is more accurate than the one between parasite and pathogen groups, e.g. every parasite block connects to all pathogen blocks, which is false.

The problem is introducted in the pathogen detection counts when pivoting longer. I would like to render the plot with the counts for each category, i.e number of hosts by unique host_ID, number of parasites by unique parasite_library and number of detection of each pathogen group.

Any help would be gratly appreciated

R sankey network ggplot • 308 views
ADD COMMENT
0
Entering edit mode

Appreciate the code and example data to make a reproducible example but you have to indicate the packages. tidyverse for sure, but where is make_long from?

ADD REPLY
0
Entering edit mode

Sorry for that. I have added them in the question. ggsankey is the package I am using for this

ADD REPLY

Login before adding your answer.

Traffic: 1391 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6