Does anyone have a good idea to combine a data frame with one containing comma-separated values?
Data frame A, which has 2 variables (mouse gene name and Ensemble ID):
head(dfA)
# gene EnsembleID
# 1 Hoxa9 ENSMUSG00000038227
# 2 Zfp128 ENSMUSG00000060397
# 3 Zfp853 ENSMUSG00000093910
# 4 Nr1h2 ENSMUSG00000060601
# 5 Nr1h3 ENSMUSG00000002108
# 6 Nr1h4 ENSMUSG00000047638
Data frame B has 3 variables (orthogroup, mouse gene Ensemble ID, and zebrafish gene Ensemble ID). Noted that in the picture below the column names are Mus_musculus.GRCm39.pep.all and Danio_rerio.GRCz11.pep.all, respectively.
I want to add a new column for "zebrafish Ensemble ID" into A as referenced information in B. So, the final data I need looks like this:
The point is, in data frame B, both Ensemble IDs' columns contain comma-separated values. This file is originally from the outputs of orthofinder.
If there is only a single value in each row, I understand that we can easily combine those data frames using merge
function in R. The length of data frame A is 1700 (= 1700 genes), so I would like to use some codes instead of search©&paste manually. Any suggestions or ideas would be really appreciated!
You can use
tidyr::separate_rows
to split the DF so it has one ID per entry. See: https://tidyr.tidyverse.org/reference/separate_rows.htmlThank you very much. I didn't think of it that way, and need to learn more. I was able to deal with this issue using the below answer.