Question

Subsetting/filtering a tibble by a combination of data stored in a second tibble

0

Entering edit mode

3.0 years ago

Mathias_H ▴ 20

Hello everyone, I am looking for the easiest way to subset a tibble by data from another tibble. Let me explain:

> Tibble_1
# A tibble: 10 x 5
   Mushroom M_Nr  color  size surface
   <chr>    <chr> <chr> <dbl> <chr>  
 1 K570     12    brown     7 ABC    
 2 K570     18    brown     9 CDF    
 3 K570     33    brown    10 FDA    
 4 K830     1     brown    14 BCA    
 5 K830     23    brown    16 BBF    
 6 K830     44    brown    15 FCA    
 7 K830     45    brown    17 ABC    
 8 K830     48    brown    14 CDF    
 9 K480     7     brown    14 FDA    
10 K480     34    brown     9 CDF 

> Tibble_2
# A tibble: 10 x 2
    size surface
   <dbl> <chr>  
 1    24 ABC    
 2     9 CDF    
 3     8 CDF    
 4    12 FDA    
 5    13 FDA    
 6    15 FDA    
 7    17 ABC    
 8    12 FCA    
 9    14 FCA    
10     9 CDF

I want to filter Tibble_1 to keep only the rows that have a combination of size and surface, that cannot be found in Tibble_2. For example, I want to keep row 1 of Tibble_1, because there is no combination of size = 7 and surface = ABC in Tibble_2. On the other hand, I do not want to keep row 2 of Tibble_1 because there is a row in Tibble_2 with the same combination of size and surface (in this case also row 2). I cannot filter for size and surface consecutively, because then I would lose some rows that I actually want to keep. (E.g. if I filter for size first, I would lose row 9 in Tibble_1, even though there is no size = 14/surface = FDA in Tibble_2)

I am mostly working with tidyverse packages, so I would be happy to find a solution within the functions of these packages. Can someone help me?

Thanks a lot!

tidyverse tibble R dplyr • 1.3k views

ADD COMMENT • link 3.0 years ago by Mathias_H ▴ 20

0

Entering edit mode

Just make a new column and filter rows based on it.

ADD REPLY • link 3.0 years ago by gglim ▴ 210

score 1 · Answer 1 · 2021-11-12

This seems like a job for dplyr::anti_join. I've not checked the output though, you might want to make sure it's what you want.

library(dplyr)
anti_join(t1, t2)

Joining, by = c("size", "surface")
# A tibble: 7 × 5
  V2       V3 V4     size surface
  <chr> <int> <chr> <int> <chr>  
1 K570     12 brown     7 ABC    
2 K570     33 brown    10 FDA    
3 K830      1 brown    14 BCA    
4 K830     23 brown    16 BBF    
5 K830     44 brown    15 FCA    
6 K830     48 brown    14 CDF    
7 K480      7 brown    14 FDA

Or, a slightly less elegant version:

t1 %>% 
    mutate(comb = paste(size, surface, sep="_")) %>% 
    filter(!comb %in% paste(t2$size, t2$surface, sep="_"))