Question

Filter unique values over columns from a dataframe in R

1

Entering edit mode

3.6 years ago

conor.whelan ▴ 10

Hello,

I have a data frame in csv format of genes expressed in different tissue types that looks like

Brain   Liver   Kidney
A4GALT  A4GNT   AACS
AAAS    AAAS    AABAD
AACS    AACS    AAGAB
AADAC   AADAC   AAK1
AADAT   AAGAB   AAMDC

I would like to sort through these to identify the genes that are unique to each tissue type to produce a data frame like so:

Brain   Liver   Kidney
A4GALT  A4GNT   AABAD
AADAT           AAK1
                AAMDC

I have tried doing this various ways in excel but the data frame is just too large.

Is there a possible function is R that can do this?

gene R • 1.3k views

ADD COMMENT • link updated 3.6 years ago by cpad0112 21k • written 3.6 years ago by conor.whelan ▴ 10

0

Entering edit mode

df %>% 
    pivot_longer(everything(),"k",values_to = "v") %>% 
    group_by(v) %>% 
    filter(n() == 1) %>% 
    ungroup() %>% 
    add_rownames() %>% 
    pivot_wider(names_from = k, values_from = v) %>% 
    select(-rowname)

# A tibble: 6 x 3
  Brain  Liver Kidney
  <chr>  <chr> <chr> 
1 A4GALT NA    NA    
2 NA     A4GNT NA    
3 NA     NA    AABAD 
4 NA     NA    AAK1  
5 AADAT  NA    NA    
6 NA     NA    AAMDC

ADD REPLY • link 3.6 years ago by cpad0112 21k

score 0 · Answer 1 · 2021-07-19

df <- 
data.frame(
  Brain=c("A4GALT", "AAAS", "AACS", "AADAC", "AADAT"),
  Liver=c("A4GNT", "AAAS", "AACS", "AADAC", "AAGAB"),
  Kidney=c("AACS", "AABAD", "AAGAB", "AAK1", "AAMDC")
)

#/ collapse to a list and count occurrence of each element:
tab <- table(unlist(unclass(df)))

#/ extract those occurring once:
once <- names(tab[tab==1])

#/ make a list for each organ with the unique elements:
unique_per_organ <- sapply(colnames(df), function(x){
  tmp <- df[,x]
  tmp[tmp %in% once]
}, simplify = FALSE)

> unique_per_organ
$Brain
[1] "A4GALT" "AADAT" 

$Liver
[1] "A4GNT"

$Kidney
[1] "AABAD" "AAK1"  "AAMDC"

If you want it back to this data.frame with "":

df_unique <- do.call(cbind, lapply(names(unique_per_organ), function(x){
  m <- unique_per_organ[[x]]
  d <- data.frame(c(m, rep('""', nrow(df)-length(m))))
  colnames(d) <- x
  d
}))

df_unique

   Brain Liver Kidney
1 A4GALT A4GNT  AABAD
2  AADAT    ""   AAK1
3     ""    ""  AAMDC
4     ""    ""     ""
5     ""    ""     ""

score 0 · Answer 2 · 2021-07-19

Kinda works...

df <- as.matrix(read.delim2('your_file.tsv'))
all.names <- unlist(data.frame(df))
duplicates <- all.names[which(duplicated(all.names))]
df[df %in% duplicates] <- ""
df <- apply(df,2,sort,decreasing=TRUE)
df
     Brain    Liver   Kidney 
[1,] "AADAT"  "A4GNT" "AAMDC"
[2,] "A4GALT" ""      "AAK1" 
[3,] ""       ""      "AABAD"
[4,] ""       ""      ""     
[5,] ""       ""      ""

score 0 · Answer 3 · 2021-07-19

Loop through columns, apply setdiff of other columns:

# example data
x <- read.table(text = "Brain   Liver   Kidney
A4GALT  A4GNT   AACS
AAAS    AAAS    AABAD
AACS    AACS    AAGAB
AADAC   AADAC   AAK1
AADAT   AAGAB   AAMDC", header = TRUE)

setNames(
  lapply(seq_along(x), function(i) setdiff(x[[ i ]], unlist(x[ -i ]))),
  colnames(x))
# $Brain
# [1] "A4GALT" "AADAT" 
# 
# $Liver
# [1] "A4GNT"
# 
# $Kidney
# [1] "AABAD" "AAK1"  "AAMDC"