Filter unique values over columns from a dataframe in R
3
1
Entering edit mode
3.4 years ago
conor.whelan ▴ 10

Hello,

I have a data frame in csv format of genes expressed in different tissue types that looks like

Brain   Liver   Kidney
A4GALT  A4GNT   AACS
AAAS    AAAS    AABAD
AACS    AACS    AAGAB
AADAC   AADAC   AAK1
AADAT   AAGAB   AAMDC

I would like to sort through these to identify the genes that are unique to each tissue type to produce a data frame like so:

Brain   Liver   Kidney
A4GALT  A4GNT   AABAD
AADAT           AAK1
                AAMDC

I have tried doing this various ways in excel but the data frame is just too large.

Is there a possible function is R that can do this?

gene R • 1.2k views
ADD COMMENT
0
Entering edit mode
df %>% 
    pivot_longer(everything(),"k",values_to = "v") %>% 
    group_by(v) %>% 
    filter(n() == 1) %>% 
    ungroup() %>% 
    add_rownames() %>% 
    pivot_wider(names_from = k, values_from = v) %>% 
    select(-rowname)

# A tibble: 6 x 3
  Brain  Liver Kidney
  <chr>  <chr> <chr> 
1 A4GALT NA    NA    
2 NA     A4GNT NA    
3 NA     NA    AABAD 
4 NA     NA    AAK1  
5 AADAT  NA    NA    
6 NA     NA    AAMDC 
ADD REPLY
0
Entering edit mode
3.4 years ago
ATpoint 86k
df <- 
data.frame(
  Brain=c("A4GALT", "AAAS", "AACS", "AADAC", "AADAT"),
  Liver=c("A4GNT", "AAAS", "AACS", "AADAC", "AAGAB"),
  Kidney=c("AACS", "AABAD", "AAGAB", "AAK1", "AAMDC")
)

#/ collapse to a list and count occurrence of each element:
tab <- table(unlist(unclass(df)))

#/ extract those occurring once:
once <- names(tab[tab==1])

#/ make a list for each organ with the unique elements:
unique_per_organ <- sapply(colnames(df), function(x){
  tmp <- df[,x]
  tmp[tmp %in% once]
}, simplify = FALSE)

> unique_per_organ
$Brain
[1] "A4GALT" "AADAT" 

$Liver
[1] "A4GNT"

$Kidney
[1] "AABAD" "AAK1"  "AAMDC"

If you want it back to this data.frame with "":

df_unique <- do.call(cbind, lapply(names(unique_per_organ), function(x){
  m <- unique_per_organ[[x]]
  d <- data.frame(c(m, rep('""', nrow(df)-length(m))))
  colnames(d) <- x
  d
}))

df_unique

   Brain Liver Kidney
1 A4GALT A4GNT  AABAD
2  AADAT    ""   AAK1
3     ""    ""  AAMDC
4     ""    ""     ""
5     ""    ""     ""
ADD COMMENT
0
Entering edit mode

Ah, thank you. I think you've solved it, the bigger problem seems to be that I wasn't creating the data frame correctly! Thank you!!

ADD REPLY
0
Entering edit mode
3.4 years ago

Kinda works...

df <- as.matrix(read.delim2('your_file.tsv'))
all.names <- unlist(data.frame(df))
duplicates <- all.names[which(duplicated(all.names))]
df[df %in% duplicates] <- ""
df <- apply(df,2,sort,decreasing=TRUE)
df
     Brain    Liver   Kidney 
[1,] "AADAT"  "A4GNT" "AAMDC"
[2,] "A4GALT" ""      "AAK1" 
[3,] ""       ""      "AABAD"
[4,] ""       ""      ""     
[5,] ""       ""      ""     
ADD COMMENT
0
Entering edit mode
3.4 years ago
zx8754 12k

Loop through columns, apply setdiff of other columns:

# example data
x <- read.table(text = "Brain   Liver   Kidney
A4GALT  A4GNT   AACS
AAAS    AAAS    AABAD
AACS    AACS    AAGAB
AADAC   AADAC   AAK1
AADAT   AAGAB   AAMDC", header = TRUE)

setNames(
  lapply(seq_along(x), function(i) setdiff(x[[ i ]], unlist(x[ -i ]))),
  colnames(x))
# $Brain
# [1] "A4GALT" "AADAT" 
# 
# $Liver
# [1] "A4GNT"
# 
# $Kidney
# [1] "AABAD" "AAK1"  "AAMDC"
ADD COMMENT

Login before adding your answer.

Traffic: 3112 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6