Creating binary matrix of absence/presence of genes in multiple column lists
3
2
Entering edit mode
5.8 years ago
lmu ▴ 20

Hello,

I'm very new to R, and I have a large data set of the presence/absence of enriched genes in the form of lists of gene names for 10 different cell lines. I am trying to display the overlap in significantly enriched genes across the different cell lines using the UpSetR package.

Currently my data looks like this, but each list is between 200 and 900 genes long

current data format

enter image description here

Whereas I want to display the data in a binary matrix format to indicate presence/absence of each unique gene in the overall set in each individual cell line list.

desired data format

enter image description here

I have been able to compile a reference column of all the unique genes across the three lists, however, I am now very stuck on how to use that reference list to convert the different gene lists into a binary format, and was wondering if someone could point me in the right direction to solving this.

Many thanks in advance for any help!

upsetr R • 5.2k views
ADD COMMENT
1
Entering edit mode

Hi there, have you found a solution for this task? I am in the same situation

ADD REPLY
1
Entering edit mode
5.8 years ago
> df = data.frame(
+     line1 = sample(LETTERS, 5),
+     line2 = sample(LETTERS, 5),
+     line3 = sample(LETTERS, 5)
+ )
> df
  line1 line2 line3
1     P     T     P
2     B     H     A
3     A     F     C
4     K     P     N
5     G     L     Y
> library(dplyr)
> library(tidyr)
> library(tibble)
> as.data.frame(t(df)) %>%
+     rownames_to_column(var = "Gene") %>%
+     gather(value,variable,-Gene) %>%
+     spread(variable,value)  %>%
+     mutate_at(vars(-Gene), funs ( ifelse ( is.na(.), 0, 1)))
   Gene A B C F G H K L N P T Y
1 line1 1 1 0 0 1 0 1 0 0 1 0 0
2 line2 0 0 0 1 0 1 0 1 0 1 1 0
3 line3 1 0 1 0 0 0 0 0 1 1 0 1
Warning message:
attributes are not identical across measure variables;
they will be dropped 
ADD COMMENT
1
Entering edit mode
20 months ago

cpad already answered it, but here's an answer using more contemporary tidyverse syntax.

example data.

df <- structure(list(LINE1 = c("A", "B", "C", NA, NA), LINE2 = c("B", 
"D", "E", "F", NA), LINE3 = c("C", "D", "G", "H", "I")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -5L))

# A tibble: 5 × 3
  LINE1 LINE2 LINE3
  <chr> <chr> <chr>
1 A     B     C    
2 B     D     D    
3 C     E     G    
4 NA    F     H    
5 NA    NA    I

Tidyverse solution.

library("tidyr")
library("dplyr")

df |>
  pivot_longer(everything()) |>
  drop_na() |>
  mutate(binary=1L) |>
  pivot_wider(names_from=value, values_from=binary, values_fill=0L)

# A tibble: 3 × 10
  name      A     B     C     D     E     G     F     H     I
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 LINE1     1     1     1     0     0     0     0     0     0
2 LINE2     0     1     0     1     1     0     1     0     0
3 LINE3     0     0     1     1     0     1     0     1     1
ADD COMMENT
1
Entering edit mode
20 months ago
zx8754 12k

Using base stack and table (example data is from rpolicastro answer):

as.data.frame.matrix(table(stack(df)))
#   LINE1 LINE2 LINE3
# A     1     0     0
# B     1     1     0
# C     1     0     1
# D     0     1     1
# E     0     1     0
# F     0     1     0
# G     0     0     1
# H     0     0     1
# I     0     0     1
ADD COMMENT

Login before adding your answer.

Traffic: 2956 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6