Question

How to find common data in replicates?

2

Entering edit mode

6.7 years ago

WUSCHEL ▴ 850

I have 3 replicates data outputs in separate three .csv files.E

Each files the first column is Gene identification numbers (AGI: my experimental targets). The three files consisted of different experimental outcomes. i.e. first file 250 targets, second ~500, and third ~ 2000.

Each file consisted with other information for the identified targets, e.g.

I want to identify common targets (AGIs) in these three .csv files and make a new working table combining other details of the common AGIs.

How can I identify the common targets (AGIs) using R programming?

r gene • 2.2k views

ADD COMMENT • link updated 6.7 years ago by c.chakraborty ▴ 180 • written 6.7 years ago by WUSCHEL ▴ 850

2

Entering edit mode

Vlookup in excel and join in Linux (twice)

ADD REPLY • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

Thank you as always cpad0112. Unfortunately, I do not know Linux. I appreciate your help as always.

ADD REPLY • link 6.7 years ago by WUSCHEL ▴ 850

1

Entering edit mode

inner join. inner join.

ADD REPLY • link 6.7 years ago by shenwei356 8.7k

3

Entering edit mode

6.7 years ago

Benn 8.4k

You can use merge with all.x=F and all.y=F arguments.

Something like:

firstMerge <- merge(df1, df2, all.x=F, all.y=F)
secondMerge <- merge(firstMerge, df3, all.x=F, all.y=F)

ADD COMMENT • link 6.7 years ago by Benn 8.4k

0

Entering edit mode

merge(df1, merge(df2, df3, all=FALSE), all=FALSE)

ADD REPLY • link 6.7 years ago by Ram 44k

1

Entering edit mode

Or Reduce, this is best when we have too many dataframes to merge:

Reduce(function(...) merge(..., by = "AGI"), list(df1, df2, df3))

ADD REPLY • link 6.7 years ago by zx8754 12k

2

Entering edit mode

6.7 years ago

cpad0112 21k

Using sqldf package, SQL like inner joins:

df1 <- data.frame(AGI = sample(1:6), value = rnorm(6,2,1))
df2 <- data.frame(AGI = sample(2:9), value = rnorm(8,0,1.5))
df3 <- data.frame(AGI = c(4,8,5), value = rnorm(3,0,2))

library(sqldf)

sqldf("select df1.AGI,
              df1.value as df1,
              df2.value as df2,
              df3.value as df3
       from df1 inner join df2
                   on df1.AGI=df2.AGI
                inner join df3
                   on df2.AGI=df3.AGI")

#   AGI       df1        df2        df3
# 1   4 0.1842957 -0.9538043 0.07660008
# 2   5 1.2598496  1.1615558 1.03723099

Or using shell:

$ join -1 1 -2 1 <(sort -k1 df1.txt) <(sort -k1 df2.txt) | join -1 1 -2 1 - <(sort -k1 df3.txt) | sort -h

"AGI" "value" "value" "value"
4 0.184295742608555 -0.953804335859278 0.0766000774913751
5 1.25984960266285 1.1615557563389 1.03723099264702

ADD COMMENT • link updated 6.7 years ago by zx8754 12k • written 6.7 years ago by cpad0112 21k

0

Entering edit mode

6.7 years ago

c.chakraborty ▴ 180

We can use Excel, VLOOKUP, for each identifier for each of the arrays, and then VLOOKUP for the values in each array-

ID          Val1     Val2   Val3
Abc    =VLOOKUP(cellA1, from array.csv rep1 col A:B, return value of cell B, False)
Def
Ghi

and so on and you can get the values for each like that. Also for average you can do =AVERAGE(A2, B2, C2) for the cells with values and that will give you mean. You can also get Median. standard deviation and so on Excel.

If you're using windows, separate cells and arrays with comma and for Mac use semi-colon. You can do it all on Excel.

ADD COMMENT • link updated 6.7 years ago by zx8754 12k • written 6.7 years ago by c.chakraborty ▴ 180

0

Entering edit mode

You can do it all on Excel

Don't. Use R. Unless you know exactly the kinds of assumptions Excel makes, and the idiosyncrasies of copy-paste in or out of Excel, and the idiosyncrasies of Excel formulas across types of lookups, do not use Excel.

ADD REPLY • link 6.7 years ago by Ram 44k

score 3 · Accepted Answer · 2018-06-15

3

Entering edit mode

6.7 years ago

zx8754 12k

Using tidyverse, try this example

# dummy data
rep1 <- data.frame(AGI = c(1:5,6), value = 1)
rep2 <- data.frame(AGI = c(2:5,6), value = 2)
rep3 <- data.frame(AGI = c(1,5,6), value = 3)

library(tidyverse)

bind_rows(list(rep1, rep2, rep3), .id = "Rep") %>% 
  spread(key = "Rep", value = "value", sep = "") %>%
  filter(complete.cases(.)) %>% 
  mutate(AverageValue = rowMeans(.[, -1]))

#   AGI Rep1 Rep2 Rep3 AverageValue
# 1   5    1    2    3            2
# 2   6    1    2    3            2

Tip: We could create the list of dataframes automatically, something like:

myReps <- lapply(list.files(), read.csv)

Then carry one with code as above:

bind_rows(myReps, .id = "Rep") %>% 
...etc

ADD COMMENT • link 6.7 years ago by zx8754 12k

1

Entering edit mode

You forgot stringsAsFactors=FALSE :-)

ADD REPLY • link 6.7 years ago by Ram 44k

0

Entering edit mode

Yes, good point, in my dummy data gene names are numeric.

ADD REPLY • link 6.7 years ago by zx8754 12k

0

Entering edit mode

Thank you very much zx8754 :) Appreciate your help with this.

ADD REPLY • link 6.7 years ago by WUSCHEL ▴ 850