Question

matching and merging data frames of different lengths

0

Entering edit mode

11.2 years ago

Diana ▴ 930

Hi,

I have 2 files and I want to match the first column in first file with the first column in second file then if they match, I want to pick the corresponding value in 2nd column of 2nd file and add it in a new column in the first file. The problem is, when I match my 2 files I end up with data frames of different lengths because my first file contains gene Ids multiple times but in the second file, I have the corresponding gene name which would of course just occur once. As an example,

File1:

Gene Id     value_sample1     value_sample2
x            0.0001            0.00000034

File2:

Gene Id     Gene Name
x           y

Thanks a lot!!

R • 51k views

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.2 years ago by Diana ▴ 930

0

Entering edit mode

have a look into the merge function

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 11.2 years ago by russhh 5.8k

0

Entering edit mode

and what do you want to have in the end, if a gene is present more than once in your first file? do you still want to get all its rows or for example only the first one?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 11.2 years ago by Martombo ★ 3.2k

0

Entering edit mode

Yes I want to get all the rows of the gene in the first file. But add the gene name in a new column. When I match the first file with second I end up with data frames of different lengths and hence I can't cbind them. Any advice?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 11.2 years ago by Diana ▴ 930

0

Entering edit mode

are you using R I guess? In R you can use the command merge as suggested by @russ_hyde, as long as your data is stored in two data.frames. you can use it like: merge(data.frame1,data.frame2) if your data.frames have the gene id in the first vector (as in your example)

A very quick command with bash would be: join <(sort file1) <(sort file2) >fileOut

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 11.2 years ago by Martombo ★ 3.2k

1

Entering edit mode

11.2 years ago

dariober 15k

If your tables are pretty big, consider using the merge function in the data.table package. It's very fast even with tables of millions of records. See quick comparison with base merge:

install.packages('data.table') ## Install package
library(data.table)

## Dummy data:
dat1<- data.table(Gene_Id= 1:10000000, value.x= rnorm(n= 10000000))
dat2<- data.table(Gene_Id= 1:10000000, value.y= rnorm(n= 10000000))

setkeyv(dat1, c('Gene_Id'))
setkeyv(dat2, c('Gene_Id'))

system.time(datm<- merge(dat1, dat2))


## Use merge in base pckg
datA<- as.data.frame(dat1)
datB<- as.data.frame(dat2)

system.time(datM<- base:::merge(datA, datB))

#   user  system elapsed
#  0.750   0.004   0.755   <<- data.table

# 73.689   1.203  74.901  <<- base

Set the option all=TRUE to include all rows from both tables.

data.table has also the fread() function which is much faster than `read.table()

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 11.2 years ago by dariober 15k

Ram · Accepted Answer · 2014-05-29

People have suggested using the merge function, and I agree. I created an example for you. You mentioned something about multiple gene IDs in the first file that were duplicates so I tried to include that in the example. If you want a different result related to which data is returned I'd check out the merge() function.

> # These commands were written in R
> #create file 1
> dat1 <- matrix(data = rnorm(12), nrow = 4, ncol = 3)
> 
> colnames(dat1) <- c("Gene_Id",     "value_sample1",     "value_sample2")
> dat1[,1] <- c("x", "y", "z", "x")
> 
> dat1
     Gene_Id value_sample1        value_sample2      
[1,] "x"     "-0.475801113106774" "0.373054164662158"
[2,] "y"     "1.28872139254973"   "0.965427944478498"
[3,] "z"     "-0.226327653738543" "0.375294465742173"
[4,] "x"     "-1.2569558077676"   "-1.58830615905416"
> 
> #create file 2
> dat2 <- matrix(data = NA, nrow = 2, ncol = 2)
> colnames(dat2) <- c("Gene_Id",     "gene_name")
> dat2[,1] <- c("x", "y")
> dat2[,2] <- c("A", "B")
> 
> dat2
     Gene_Id gene_name
[1,] "x"     "A"      
[2,] "y"     "B"      
> 
> #merge file 1 and file 2 using the "Gene_Id" column
> merge(dat1, dat2, by.x = "Gene_Id", by.y="Gene_Id")
  Gene_Id      value_sample1     value_sample2 gene_name
1       x -0.475801113106774 0.373054164662158         A
2       x   -1.2569558077676 -1.58830615905416         A
3       y   1.28872139254973 0.965427944478498         B
>