How to find common data in replicates?
4
2
Entering edit mode
6.5 years ago
WUSCHEL ▴ 810

I have 3 replicates data outputs in separate three .csv files.E

Each files the first column is Gene identification numbers (AGI: my experimental targets). The three files consisted of different experimental outcomes. i.e. first file 250 targets, second ~500, and third ~ 2000.

Each file consisted with other information for the identified targets, e.g.

Capture

I want to identify common targets (AGIs) in these three .csv files and make a new working table combining other details of the common AGIs.

How can I identify the common targets (AGIs) using R programming?

r gene • 2.1k views
ADD COMMENT
2
Entering edit mode

Vlookup in excel and join in Linux (twice)

ADD REPLY
0
Entering edit mode

Thank you as always cpad0112. Unfortunately, I do not know Linux. I appreciate your help as always.

ADD REPLY
1
Entering edit mode

inner join. inner join.

ADD REPLY
3
Entering edit mode
6.5 years ago
zx8754 12k

Using tidyverse, try this example

# dummy data
rep1 <- data.frame(AGI = c(1:5,6), value = 1)
rep2 <- data.frame(AGI = c(2:5,6), value = 2)
rep3 <- data.frame(AGI = c(1,5,6), value = 3)

library(tidyverse)

bind_rows(list(rep1, rep2, rep3), .id = "Rep") %>% 
  spread(key = "Rep", value = "value", sep = "") %>%
  filter(complete.cases(.)) %>% 
  mutate(AverageValue = rowMeans(.[, -1]))

#   AGI Rep1 Rep2 Rep3 AverageValue
# 1   5    1    2    3            2
# 2   6    1    2    3            2

Tip: We could create the list of dataframes automatically, something like:

myReps <- lapply(list.files(), read.csv)

Then carry one with code as above:

bind_rows(myReps, .id = "Rep") %>% 
...etc
ADD COMMENT
1
Entering edit mode

You forgot stringsAsFactors=FALSE :-)

ADD REPLY
0
Entering edit mode

Yes, good point, in my dummy data gene names are numeric.

ADD REPLY
0
Entering edit mode

Thank you very much zx8754 :) Appreciate your help with this.

ADD REPLY
3
Entering edit mode
6.5 years ago
Benn 8.3k

You can use merge with all.x=F and all.y=F arguments.

Something like:

firstMerge <- merge(df1, df2, all.x=F, all.y=F)
secondMerge <- merge(firstMerge, df3, all.x=F, all.y=F)
ADD COMMENT
0
Entering edit mode
merge(df1, merge(df2, df3, all=FALSE), all=FALSE)
ADD REPLY
1
Entering edit mode

Or Reduce, this is best when we have too many dataframes to merge:

Reduce(function(...) merge(..., by = "AGI"), list(df1, df2, df3))
ADD REPLY
2
Entering edit mode
6.5 years ago

Using sqldf package, SQL like inner joins:

df1 <- data.frame(AGI = sample(1:6), value = rnorm(6,2,1))
df2 <- data.frame(AGI = sample(2:9), value = rnorm(8,0,1.5))
df3 <- data.frame(AGI = c(4,8,5), value = rnorm(3,0,2))

library(sqldf)

sqldf("select df1.AGI,
              df1.value as df1,
              df2.value as df2,
              df3.value as df3
       from df1 inner join df2
                   on df1.AGI=df2.AGI
                inner join df3
                   on df2.AGI=df3.AGI")

#   AGI       df1        df2        df3
# 1   4 0.1842957 -0.9538043 0.07660008
# 2   5 1.2598496  1.1615558 1.03723099

Or using shell:

$ join -1 1 -2 1 <(sort -k1 df1.txt) <(sort -k1 df2.txt) | join -1 1 -2 1 - <(sort -k1 df3.txt) | sort -h

"AGI" "value" "value" "value"
4 0.184295742608555 -0.953804335859278 0.0766000774913751
5 1.25984960266285 1.1615557563389 1.03723099264702
ADD COMMENT
0
Entering edit mode
6.5 years ago
c.chakraborty ▴ 180

We can use Excel, VLOOKUP, for each identifier for each of the arrays, and then VLOOKUP for the values in each array-

ID          Val1     Val2   Val3
Abc    =VLOOKUP(cellA1, from array.csv rep1 col A:B, return value of cell B, False)
Def
Ghi

and so on and you can get the values for each like that. Also for average you can do =AVERAGE(A2, B2, C2) for the cells with values and that will give you mean. You can also get Median. standard deviation and so on Excel.

If you're using windows, separate cells and arrays with comma and for Mac use semi-colon. You can do it all on Excel.

ADD COMMENT
0
Entering edit mode

You can do it all on Excel

Don't. Use R. Unless you know exactly the kinds of assumptions Excel makes, and the idiosyncrasies of copy-paste in or out of Excel, and the idiosyncrasies of Excel formulas across types of lookups, do not use Excel.

ADD REPLY

Login before adding your answer.

Traffic: 1032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6