Sankey + dot plot
1
0
Entering edit mode
9 weeks ago
Sofia • 0

Hi,

I'm having trouble with Sankey and dot plots. Could anyone help me with the R script to draw them?

Thank you

sankey and dot plot

R sankey gene-ontology GO • 859 views
ADD COMMENT
1
Entering edit mode

What is the problem, where is the code you are trying? Provide code, example data, expected output.

ADD REPLY
1
Entering edit mode

I agree with zx8754 you haven't indicated what the problem is. To me, the sankey looks reasonable given how much data you are trying to plot, as does the dot plot, though I don't know exactly what you are trying to show so it's hard to say if it's appropriate.

ADD REPLY
0
Entering edit mode

Hi,

This is the header of my data: pathway GeneRatio FDR genesID count

In the genesID, I have a list of genes separated by two spaces.

This is my script and output

install.packages("ggplot2")
install.packages("dplyr")
install.packages("ggalluvial")  # For Sankey-like plots
install.packages("tidyr")        # For unnest function

library(dplyr)  # For data manipulation
library(tidyr)  # For separate_rows function

# Load your semicolon-separated CSV
df <- read.csv("data.csv", sep = ";")

df <- df %>%
  mutate(FDR = -log10(FDR))

# Check if the GeneID column exists
if (!"GeneID" %in% colnames(df)) {
  stop("GeneID column not found in the data frame.")
}

# Split GeneID using strsplit and expand the data frame manually
df_expanded <- df[rep(1:nrow(df), sapply(strsplit(as.character(df$GeneID), "/"), length)), ]
df_expanded$GeneID <- unlist(strsplit(as.character(df$GeneID), "/"))

# Ensure other columns are repeated correctly
other_cols <- df %>%
  select(-GeneID) %>%
  slice(rep(1:n(), times = sapply(strsplit(as.character(df$GeneID), "/"), length)))

# Combine other columns with the expanded GeneID
df_expanded <- cbind(other_cols, GeneID = df_expanded$GeneID)



library(ggplot2)
library(ggalluvial)

# Create a combined dot-Sankey plot
dot_sankey_plot <- ggplot(df_expanded, aes(axis1 = Pathway, axis2 = GeneID, y = Count)) +
  geom_alluvium(aes(fill = GeneRatio)) +  # Add the alluvium for Sankey-like flow
  geom_point(aes(x = GeneID, y = FDR, size = Count, color = GeneRatio), shape = 21) +  # Points for significance
  scale_color_gradient(low = "blue", high = "red") +  # Gradient for GeneRatio
  labs(title = "Dot-Sankey Plot of Pathways and Genes",
       x = "Pathway",
       y = "Count / FDR",
       fill = "Gene Ratio") +
  theme_minimal() +
  theme(legend.position = "right")

# Display the plot
print(dot_sankey_plot)

enter image description here

ADD REPLY
0
Entering edit mode

You still haven't elaborated on what your problem is. Are you trying to recreate the top image and the bottom is what you end up with?

ADD REPLY
0
Entering edit mode

I think both since this is geom_alluvium this might be for the sankey

ADD REPLY
0
Entering edit mode

Thank you for your attention. Yes, the top is what I need and the bottom is my output after the script, using a table with pathway, GeneRatio, FDR, genesID, count as data.

ADD REPLY
0
Entering edit mode

Please add your dataframe that would be helpful for others to recreate the issue and use you code may be small subset of your data would be helpful

head(dput(df))
ADD REPLY
0
Entering edit mode

Hi, thank you for your attention. Here is my dataframe as you asked

structure(list(Pathway = c("Alpha-amino acid biosynthetic proc. ", 
"Carboxylic acid biosynthetic proc. ", "Organic acid biosynthetic proc. ", 
"Small molecule biosynthetic proc. ", "Carboxylic acid metabolic proc. ", 
"Oxoacid metabolic proc. ", "Organic acid metabolic proc. ", 
"Small molecule metabolic proc. ", "Cellular catabolic proc. ", 
"Catabolic proc. "), GeneRatio = c(0.133333333333333, 0.0668693009118541, 
0.0664652567975831, 0.0558375634517767, 0.0427263479145473, 0.0426164519326065, 
0.0418287937743191, 0.0377066115702479, 0.0324729392173189, 0.0299465240641711
), FDR = c(3.8517556300416, 4.94174664623985, 4.94174664623985, 
5.91926984684473, 5.20765687373531, 5.20765687373531, 5.19008290917224, 
7.78847331776209, 5.91926984684473, 5.20765687373531), GeneID = c("ASNS/CTH/ASS1/PSAT1/CBS/ATP2B4/OAT/PHGDH/MTHFD1/PLOD3", 
"ASNS/CTH/ASS1/ACLY/PSAT1/OXSM/CBS/NA/CASP1/ATP2B4/MTHFD1/KYNU/OLAH/AKR1C3/OAT/PHGDH/PLOD3/HSD17B12/FABP5/ERLIN2/PRMT3/CBR1", 
"ASNS/CTH/ASS1/ACLY/PSAT1/OXSM/CBS/NA/CASP1/ATP2B4/MTHFD1/KYNU/OLAH/AKR1C3/OAT/PHGDH/PLOD3/HSD17B12/FABP5/ERLIN2/PRMT3/CBR1", 
"ASNS/COQ9/COQ5/CTH/COQ6/ASS1/ACLY/PSAT1/OXSM/CBS/FDXR/PC/NA/CASP1/ATP2B4/MTHFD1/KYNU/ERLIN2/OLAH/AKR1C3/OAT/PHGDH/PLOD3/ALDOC/HSD17B12/ADK/FABP5/PTPN2/PRMT3/SIRT5/AACS/ACSS3/CBR1", 
"ACAA1/OAT/ASNS/ARG2/QPRT/FAH/ALDOC/KYNU/CTH/ASS1/ACLY/HSD17B4/PSAT1/IDH1/EPHX1/OXSM/CBS/PC/AKR1C3/NA/DLD/CASP1/ATP2B4/MTHFD1/ALDH1L2/OLAH/FOXK1/ACAD9/GPX1/LYPLA2/AACS/PHGDH/PLOD3/SLC27A1/HSD17B12/FABP5/GFPT1/NUDT19/ERLIN2/PRMT3/GSS/CBR1", 
"ACAA1/OAT/ASNS/ARG2/QPRT/FAH/ALDOC/KYNU/CTH/ASS1/ACLY/HSD17B4/PSAT1/IDH1/EPHX1/OXSM/CBS/PC/AKR1C3/NA/DLD/CASP1/ATP2B4/MTHFD1/ALDH1L2/OLAH/FOXK1/ACAD9/GPX1/LYPLA2/AACS/PHGDH/PLOD3/SLC27A1/HSD17B12/FABP5/GFPT1/NUDT19/ERLIN2/PRMT3/GSS/ABHD14B/CBR1", 
"ACAA1/OAT/ASNS/ARG2/QPRT/FAH/ALDOC/KYNU/CTH/ASS1/ACLY/HSD17B4/PSAT1/IDH1/EPHX1/OXSM/CBS/PC/AKR1C3/NA/DLD/CASP1/ATP2B4/MTHFD1/ALDH1L2/OLAH/FOXK1/ACAD9/GPX1/LYPLA2/AACS/PHGDH/PLOD3/SLC27A1/HSD17B12/FABP5/GFPT1/NUDT19/ERLIN2/PRMT3/GSS/ABHD14B/CBR1", 
"ACAA1/OAT/ASNS/ARG2/COQ9/MTHFD1/QPRT/FAH/NAMPT/ALDOC/COQ5/KYNU/CTH/COQ6/NUDT10/PPCS/ASS1/ACLY/HSD17B4/PSAT1/IDH1/EPHX1/OXSM/CBS/FLAD1/FDXR/DCXR/PC/IMPDH2/FUCA1/CA13/AKR1C3/NUDT11/GFPT1/NA/DLD/SNX17/CASP1/AIFM2/ATP2B4/DGUOK/ALDH1L2/UCK2/ERLIN2/OLAH/CBR1/FOXK1/ACAD9/GPX1/BAD/LYPLA2/MAT2B/SCARB1/AACS/PHGDH/PLOD3/HDLBP/CAT/SLC27A1/ HSD17B12/ADK/FABP5/PTPN2/NUDT19/BRAT1/PRMT3/SIRT5/DIP2A/ERH/GSS/ACSS3/ABHD14B/NEIL2", 
"ACAA1/OAT/NEDD4/ARG2/DIS3/ACHE/HMOX1/PYGL/TIMP1/QPRT/FAH/UBE2S/SKP1/KYNU/CLU/CAT/NUDT10/PSME3/HSD17B4/STAM/PCYOX1L/CLGN/FBXL18/AGL/ERAP1/FBXO22/TMUB2/GUSB/UBE2L3/LONP1/NUDT11/PSMB10/FIS1/GPX1/NA/HGS/LYPLA2/ATP2B4/GPC1/FAP/DLD/PON2/WFS1/RBM24/OPTN/RBM38/HSPBP1/GSTM3/ALDH1L2/TGFB1I1/ERLIN2/DFFA/CBS/FOXK1/BCL2/TBK1/AKR1C3/SCARB1/SRPX/UCHL3/IDH1/RCN3/EPHX1/NUDT19/STX12/UFM1/ITGB4/CPTP/METTL3/FUCA1/SNX5/BAD/HAX1/PRKCA/NEIL2/MTMR14", 
"LYPLA2/ACAA1/OAT/NEDD4/ARG2/DIS3/ACHE/HMOX1/PYGL/TIMP1/QPRT/FAH/UBE2S/ALDOC/SKP1/KYNU/CLU/CAT/NUDT10/PSME3/HSD17B4/STAM/PCYOX1L/CLGN/FBXL18/AGL/ERAP1/FBXO22/TMUB2/GUSB/FUCA1/UBE2L3/LONP1/NUDT11/PSMB10/FIS1/GPX1/NA/SNX17/HGS/ATP2B4/GPC1/FAP/DLD/PON2/WFS1/RBM24/UCHL3/OPTN/RBM38/HSPBP1/GSTM3/ALDH1L2/TGFB1I1/ERLIN2/DFFA/CBS/FOXK1/BCL2/TBK1/AKR1C3/HSD17B11/BAD/FYN/SCARB1/SRPX/CDK4/IDH1/RCN3/EPHX1/NUDT19/STX12/UFM1/ITGB4/CPTP/METTL3/SNX5/NAGLU/HAX1/PRKCA/NEIL2/MTMR14"
), Count = c(10L, 22L, 22L, 33L, 42L, 43L, 43L, 73L, 78L, 84L
)), class = "data.frame", row.names = c(NA, -10L))
                               Pathway  GeneRatio      FDR
1 Alpha-amino acid biosynthetic proc.  0.13333333 3.851756
2  Carboxylic acid biosynthetic proc.  0.06686930 4.941747
3     Organic acid biosynthetic proc.  0.06646526 4.941747
4   Small molecule biosynthetic proc.  0.05583756 5.919270
5     Carboxylic acid metabolic proc.  0.04272635 5.207657
6             Oxoacid metabolic proc.  0.04261645 5.207657
                                                                                                                                                                                                                                                GeneID
1                                                                                                                                                                                                ASNS/CTH/ASS1/PSAT1/CBS/ATP2B4/OAT/PHGDH/MTHFD1/PLOD3
2                                                                                                                           ASNS/CTH/ASS1/ACLY/PSAT1/OXSM/CBS/NA/CASP1/ATP2B4/MTHFD1/KYNU/OLAH/AKR1C3/OAT/PHGDH/PLOD3/HSD17B12/FABP5/ERLIN2/PRMT3/CBR1
3                                                                                                                           ASNS/CTH/ASS1/ACLY/PSAT1/OXSM/CBS/NA/CASP1/ATP2B4/MTHFD1/KYNU/OLAH/AKR1C3/OAT/PHGDH/PLOD3/HSD17B12/FABP5/ERLIN2/PRMT3/CBR1
4                                                                   ASNS/COQ9/COQ5/CTH/COQ6/ASS1/ACLY/PSAT1/OXSM/CBS/FDXR/PC/NA/CASP1/ATP2B4/MTHFD1/KYNU/ERLIN2/OLAH/AKR1C3/OAT/PHGDH/PLOD3/ALDOC/HSD17B12/ADK/FABP5/PTPN2/PRMT3/SIRT5/AACS/ACSS3/CBR1
5         ACAA1/OAT/ASNS/ARG2/QPRT/FAH/ALDOC/KYNU/CTH/ASS1/ACLY/HSD17B4/PSAT1/IDH1/EPHX1/OXSM/CBS/PC/AKR1C3/NA/DLD/CASP1/ATP2B4/MTHFD1/ALDH1L2/OLAH/FOXK1/ACAD9/GPX1/LYPLA2/AACS/PHGDH/PLOD3/SLC27A1/HSD17B12/FABP5/GFPT1/NUDT19/ERLIN2/PRMT3/GSS/CBR1
6 ACAA1/OAT/ASNS/ARG2/QPRT/FAH/ALDOC/KYNU/CTH/ASS1/ACLY/HSD17B4/PSAT1/IDH1/EPHX1/OXSM/CBS/PC/AKR1C3/NA/DLD/CASP1/ATP2B4/MTHFD1/ALDH1L2/OLAH/FOXK1/ACAD9/GPX1/LYPLA2/AACS/PHGDH/PLOD3/SLC27A1/HSD17B12/FABP5/GFPT1/NUDT19/ERLIN2/PRMT3/GSS/ABHD14B/CBR1
  Count
1    10
2    22
3    22
4    33
5    42
6    43
ADD REPLY
3
Entering edit mode
9 weeks ago
dthorbur ★ 2.5k

Here is a rough outline you can start to play with using the dataset you provided. It creates separate Sankey and dot plots, but you can create a vector to reorder the pathway so that both plots are ordered correctly.

library(ggalluvial)
library(ggplot2)
library(tidyr)

## Loading toy dataset
dat <- structure(list(
  Pathway = c("Alpha-amino acid biosynthetic proc.", "Carboxylic acid biosynthetic proc.", 
    "Organic acid biosynthetic proc.", "Small molecule biosynthetic proc.", 
    "Carboxylic acid metabolic proc.", "Oxoacid metabolic proc."), 
  GeneRatio = c(0.133333, 0.066869, 0.066465, 0.055838, 0.042726, 0.042616), 
  FDR = c(3.851756, 4.941747, 4.941747, 5.919270, 5.207657, 5.207657), 
  GeneID = c("ASNS/CTH/ASS1/PSAT1/CBS/ATP2B4/OAT/PHGDH/MTHFD1/PLOD3", 
    "ASNS/CTH/ASS1/ACLY/PSAT1/OXSM/CBS/NA/CASP1/ATP2B4/MTHFD1/KYNU/OLAH/AKR1C3/OAT/PHGDH/PLOD3/HSD17B12/FABP5/ERLIN2/PRMT3/CBR1", 
    "ASNS/CTH/ASS1/ACLY/PSAT1/OXSM/CBS/NA/CASP1/ATP2B4/MTHFD1/KYNU/OLAH/AKR1C3/OAT/PHGDH/PLOD3/HSD17B12/FABP5/ERLIN2/PRMT3/CBR1", 
    "ASNS/COQ9/COQ5/CTH/COQ6/ASS1/ACLY/PSAT1/OXSM/CBS/FDXR/PC/NA/CASP1/ATP2B4/MTHFD1/KYNU/ERLIN2/OLAH/AKR1C3/OAT/PHGDH/PLOD3/ALDOC/HSD17B12/ADK/FABP5/PTPN2/PRMT3/SIRT5/AACS/ACSS3/CBR1", 
    "ACAA1/OAT/ASNS/ARG2/QPRT/FAH/ALDOC/KYNU/CTH/ASS1/ACLY/HSD17B4/PSAT1/IDH1/EPHX1/OXSM/CBS/PC/AKR1C3/NA/DLD/CASP1/ATP2B4/MTHFD1/ALDH1L2/OLAH/FOXK1/ACAD9/GPX1/LYPLA2/AACS/PHGDH/PLOD3/SLC27A1/HSD17B12/FABP5/GFPT1/NUDT19/ERLIN2/PRMT3/GSS/CBR1", 
    "ACAA1/OAT/ASNS/ARG2/QPRT/FAH/ALDOC/KYNU/CTH/ASS1/ACLY/HSD17B4/PSAT1/IDH1/EPHX1/OXSM/CBS/PC/AKR1C3/NA/DLD/CASP1/ATP2B4/MTHFD1/ALDH1L2/OLAH/FOXK1/ACAD9/GPX1/LYPLA2/AACS/PHGDH/PLOD3/SLC27A1/HSD17B12/FABP5/GFPT1/NUDT19/ERLIN2/PRMT3/GSS/ABHD14B/CBR1"),
  Count = c(10L, 22L, 22L, 33L, 42L, 43L)), class = "data.frame", row.names = c(NA, -6L))

## Splitting the GeneID by /
sankey_dat <- dat %>% 
  separate_rows(GeneID, sep = "/")

## Draw the Sankey network
ggplot(sankey_dat, aes(axis1 = GeneID, axis2 = Pathway, y = Count)) +
  geom_alluvium(aes(fill = Pathway), colour = "black") +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  scale_x_discrete(limits = c("Pathway", "GeneID"), labels = NULL, expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0), labels = NULL) +
  labs(y = "Gene Count", x = "", fill = "Pathway") +
  theme_classic(base_size = 15) +
  theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())

## Draw the dot plot
ggplot(dat, aes(x = GeneRatio, y = reorder(Pathway, GeneRatio), size = Count, fill = FDR)) +
  geom_point(shape=21, colour = "black") +
  scale_fill_gradient(low = "yellow", high = "blue") +
  labs(x = "Gene Ratio", y = "Pathway", color = "FDR", size = "Gene Count") +
  theme_classic()
ADD COMMENT
0
Entering edit mode

Thank you so much dthorbur, this is so near what I need. Can I ask you why you use "L" in the count variable? Count = c(10L, 22L, 22L, 33L, 42L, 43L)), class = "data.frame", row.names = c(NA, -6L))

Have a beautiful day

ADD REPLY
2
Entering edit mode

That's how R identifies that the data was an integer. Here is a discussion on the matter on StackOverflow.

ADD REPLY

Login before adding your answer.

Traffic: 2343 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6