Question

How can I join multiple columns from different datasets together in R?

1

Entering edit mode

3.5 years ago

valentinavan ▴ 50

I have many different datasets from different samples. Each of these has two columns: A - species list and B - counts. Not all of these species are present in each of my samples therefore I would like to merge these columns together and when one of these species is not present it will be NA/zero.

In summary my new merged dataset needs to have: column A with all the species list and the following columns with the species counts for each of my samples.

I have done this with "full.join" but when I have many datasets to merge, this is not convenient anymore. I guess I would need to build a loop but I am not really good at it yet and I'd need some help.

Any help would be more than appreciated, I am stuck.

Thanks in advance

r • 2.3k views

ADD COMMENT • link 3.5 years ago by valentinavan ▴ 50

1

Entering edit mode

df code taken from Arup Ghosh 's post:

with dplyr:

> x %>% 
+     left_join(y) %>% 
+     left_join(z) %>% 
+     replace(is.na(.), 0)
Joining, by = "gene_symbol"
Joining, by = "gene_symbol"
  gene_symbol sample1 sample2 sample3
1           a       1       0       0
2           b       2       4       0
3           c       3       5       0
4           f       4       0       7
5           g       5       0       8

with sqldf:

> library(sqldf)

> sqldf('select x.*, y.sample2, z.sample3 from x 
        left join y on x.gene_symbol=y.gene_symbol 
        left join z on x.gene_symbol=z.gene_symbol')

ADD REPLY • link 3.5 years ago by cpad0112 21k

0

Entering edit mode

Merge unequal dataframes and replace missing rows with 0 ?

ADD REPLY • link 3.5 years ago by darklings ▴ 580

score 2 · Answer 1 · 2021-05-20

Create a list of dataframes and use Reduce. PS. Make sure the name of the first column is the same in all the data frames and the second column name is unique(sample name).

x <- data.frame(gene_symbol = c("a","b","c","f","g"), sample1 = 1:5, stringsAsFactors=FALSE)
y <- data.frame(gene_symbol = c("b","c","d"), sample2 = 4:6, stringsAsFactors=FALSE)
z <- data.frame(gene_symbol = c("f","g","k","l"), sample3 = 7:10, stringsAsFactors=FALSE)

df_list <- list(x,y,z)

merged.df<-Reduce(function(...) merge(..., all=T), df_list)

merged.df[is.na(merged.df)]<-0

merged.df

  gene_symbol sample1 sample2 sample3
1           a       1       0       0
2           b       2       4       0
3           c       3       5       0
4           d       0       6       0
5           f       4       0       7
6           g       5       0       8
7           k       0       0       9
8           l       0       0      10

Source: https://stackoverflow.com/a/34393416/3548799

score 1 · Answer 2 · 2021-05-20

Arup's solution is good but I find Reduce a bit cryptic. The solution below is pretty much a translation of "merge these datasets one after the other":

# Some dummy data:
x <- data.frame(gene_symbol = c("a","b","c","f","g"), sample1 = 1:5, stringsAsFactors=FALSE)
y <- data.frame(gene_symbol = c("b","c","d"), sample2 = 4:6, stringsAsFactors=FALSE)
z <- data.frame(gene_symbol = c("f","g","k","l"), sample3 = 7:10, stringsAsFactors=FALSE)


df_list <- list(x, y ,z)

mrg <- df_list[[1]]
for(i in 2:length(df_list)) {
    mrg <- merge(mrg, df_list[[i]], all= TRUE, by= 'gene_symbol')
}

mrg[is.na(mrg)] <- 0