Question

How to combine a data frame with another data frame containing comma-separated values?

1

Entering edit mode

3.5 years ago

sasa ▴ 10

Does anyone have a good idea to combine a data frame with one containing comma-separated values?

Data frame A, which has 2 variables (mouse gene name and Ensemble ID):

head(dfA)
#     gene         EnsembleID    
# 1  Hoxa9 ENSMUSG00000038227               
# 2 Zfp128 ENSMUSG00000060397 
# 3 Zfp853 ENSMUSG00000093910 
# 4  Nr1h2 ENSMUSG00000060601               
# 5  Nr1h3 ENSMUSG00000002108               
# 6  Nr1h4 ENSMUSG00000047638

Data frame B has 3 variables (orthogroup, mouse gene Ensemble ID, and zebrafish gene Ensemble ID). Noted that in the picture below the column names are Mus_musculus.GRCm39.pep.all and Danio_rerio.GRCz11.pep.all, respectively.

enter image description here

I want to add a new column for "zebrafish Ensemble ID" into A as referenced information in B. So, the final data I need looks like this:
enter image description here

The point is, in data frame B, both Ensemble IDs' columns contain comma-separated values. This file is originally from the outputs of orthofinder.

If there is only a single value in each row, I understand that we can easily combine those data frames using merge function in R. The length of data frame A is 1700 (= 1700 genes), so I would like to use some codes instead of search&copy&paste manually. Any suggestions or ideas would be really appreciated!

frame python R orthofinder comma data • 1.8k views

ADD COMMENT • link 3.5 years ago by sasa ▴ 10

2

Entering edit mode

You can use tidyr::separate_rows to split the DF so it has one ID per entry. See: https://tidyr.tidyverse.org/reference/separate_rows.html

ADD REPLY • link 3.5 years ago by Ram 44k

0

Entering edit mode

Thank you very much. I didn't think of it that way, and need to learn more. I was able to deal with this issue using the below answer.

ADD REPLY • link 3.5 years ago by sasa ▴ 10

score 5 · Accepted Answer · 2021-08-25

For dataframe manipulation, in general, you should look into the dplyr and tidyr packages, they offer endless possibilities if you learn to manipulate them (lots of practice will help). A good and concise cheatsheet is available here.

Regarding this problem in particular, something like this should work:

library(dplyr)
library(tidyr)

dfA <- data.frame(gene = c("a", "b", "c", "d", "e"),
                  mus_id = c("mus001", "mus002", "mus003", "mus004", "mus005"))

dfB <- data.frame(mus_id = c("mus001", "mus002", "mus003,mus002", "mus004"),
                  zeb_id = c("dar001", "dar002,dar004", "dar003,dar002", "dar004"))

dfC <-  dfB %>%
  tidyr::separate_rows(zeb_id, sep = ",") %>%
  tidyr::separate_rows(mus_id, sep = ",") %>%
  dplyr::group_by(mus_id) %>%
  dplyr::summarise(zeb_id = paste0(sort(unique(na.omit(zeb_id))), collapse = ','))

dfD <- dfA %>%
  dplyr::left_join(dfC)