Allowing duplicates when matching values from two tables
2
0
Entering edit mode
5.3 years ago
yp19 ▴ 70

Hi all!

I have a reference table of some pathway information that looks like this (rownames and first column are the same)

                [,1]              [,2]                                                
PTHR11566:SF148 "PTHR11566:SF148" "INTERFERON-INDUCED GTP-BINDING PROTEIN MXA-RELATED"
PTHR19353:SF12  "PTHR19353:SF12"  "FATTY ACID DESATURASE 2"                           
PTHR31490:SF39  "PTHR31490:SF39"  ""                                                  
PTHR45744:SF9   "PTHR45744:SF9"   ""                                                  
PTHR16223:SF138 "PTHR16223:SF138" ""                                                  
PTHR45618:SF8   "PTHR45618:SF8"   "MITOCHONDRIAL UNCOUPLING PROTEIN 4"

and I'm trying to use this reference table to give each value in column 2 of this table below (called table2) its appropriate description (which can be found in column 2 of above table)

   [,1]              [,2]                                                
    1             PTHR19353:SF12                            
    2             PTHR19353:SF12                                                                                          
    3             PTHR45618:SF8

The catch is that some values in column 2 above are duplicated, so R only returns unique values but i need all results to be returned. I tried using a list, dataframe, and now a vector to store results of matching, but still having the same issue.

The desired output for my example above would be

PTHR19353:SF12                  FATTY ACID DESATURASE 2   
PTHR19353:SF12                  FATTY ACID DESATURASE 2                                                           
PTHR45618:SF8                   MITOCHONDRIAL UNCOUPLING PROTEIN 4

Here is the code I have going to try to solve this but it is not giving me all the results:

to.fill=c()
for(name in as.character(table2[,2])){
    one=reference.table[name,]
    one.desc=one[2]
    to.fill[name]=one.desc
}

Any idea on how to get around this? Thank you for your input!

R • 908 views
ADD COMMENT
3
Entering edit mode
5.3 years ago
Chirag Parsania ★ 2.0k

You can also use dplyr::left_join(). See the example below

====== Edited slightly to show duplicate values.

library(tidyverse)

## create source data table having 2 columns : pathway_id and description
source_table <- tibble::tribble(
                       ~pathway_id,                                                    ~description,
             "PTHR11566:SF148", "INTERFERON-INDUCED GTP-BINDING PROTEIN MXA-RELATED",
              "PTHR19353:SF12",                            "FATTY ACID DESATURASE 2",
              "PTHR31490:SF39",                                                   "",
               "PTHR45744:SF9",                                                   "",
             "PTHR16223:SF138",                                                   "",
               "PTHR45618:SF8",                 "MITOCHONDRIAL UNCOUPLING PROTEIN 4"
              )


## create  query table through some random pathway ids 
query_table <- tibble::tibble(pathway_id =  sample(source_table$pathway_id, 10 , replace = T))
query_table
#> # A tibble: 10 x 1
#>    pathway_id    
#>    <chr>         
#>  1 PTHR45744:SF9 
#>  2 PTHR45744:SF9 
#>  3 PTHR45744:SF9 
#>  4 PTHR19353:SF12
#>  5 PTHR45618:SF8 
#>  6 PTHR45744:SF9 
#>  7 PTHR31490:SF39
#>  8 PTHR45744:SF9 
#>  9 PTHR19353:SF12
#> 10 PTHR19353:SF12

## map desc from source table to query table 
query_table %>% dplyr::left_join(source_table ,)
#> Joining, by = "pathway_id"
#> # A tibble: 10 x 2
#>    pathway_id     description                       
#>    <chr>          <chr>                             
#>  1 PTHR45744:SF9  ""                                
#>  2 PTHR45744:SF9  ""                                
#>  3 PTHR45744:SF9  ""                                
#>  4 PTHR19353:SF12 FATTY ACID DESATURASE 2           
#>  5 PTHR45618:SF8  MITOCHONDRIAL UNCOUPLING PROTEIN 4
#>  6 PTHR45744:SF9  ""                                
#>  7 PTHR31490:SF39 ""                                
#>  8 PTHR45744:SF9  ""                                
#>  9 PTHR19353:SF12 FATTY ACID DESATURASE 2           
#> 10 PTHR19353:SF12 FATTY ACID DESATURASE 2

Created on 2019-08-04 by the reprex package (v0.3.0)

ADD COMMENT
2
Entering edit mode
5.3 years ago
Ram 44k

Use merge instead of this loop. It's a far more efficient way to "join" 2 data.frames.

tbldf1 <- data.frame(table1)
colnames(tbldf1) <- c('col1', 'col2')
tbldf2 <- data.frame(table2)
colnames(tbldf2) <- c('col1', 'col2')
merge(x = tbldf1, y = tbldf2, by.x = 'col1', by.y = 'col2')
ADD COMMENT

Login before adding your answer.

Traffic: 2667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6