Matching samples between big datasets
0
0
Entering edit mode
2.1 years ago

Hi all,

I am new here and trying to figure out a specific task:

Lets say I have the following data_frame:

gene = c("a","b","c","d","e","f","g","h","i","j","k", "a","b","c","d","e","f","g","h","i","j","k", "a","b","c","d","e","f","g","h","i","j","k")
sample1 = c("a","a","a","a","a","a","a","a","a","a", "a","b","b","b","b","b","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c")
expression1 = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24","25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "36")

data_frame(gene, sample1, expression1)

This is a fraction of a very big dataset I have containing 20k genes and 700 samples with expression for all genes.

Also, I have a different data_frame containing samples with corresponding gene expression values, but different sample identifiers, lets say the following:

gene = c("a","b","c","d","e","f","g","h","i","j","k")
sample2 = c("g","g","g","g","g","g","g","g","g","g","g")
expression2 = c("14.7", "15", "17", "16", "18", "20", "21", "22", "23", "24", "25")
   gene  sample2 expression2
   <chr> <chr>   <chr>
 1 a     g       14.7
 2 b     g       15
 3 c     g       17
 4 d     g       16
 5 e     g       18
 6 f     g       20
 7 g     g       21
 8 h     g       22
 9 i     g       23
10 j     g       24
11 k     g       25

What I need to do, is match the gene expression from df2 as closely as possible to a sample within df1 and make R report back the corresponding sample identifiers.

It would look something like this maybe:

   gene  sample2 expression2 sample1 expression1
   <chr> <chr>   <chr>       <chr>   <chr>
 1 a     g       14.7        b       14
 2 b     g       15          b       15
 3 c     g       17          b       16
 4 d     g       16          b       17
 5 e     g       18          b       18
 6 f     g       20          b       19
 7 g     g       21          b       20
 8 h     g       22          b       21
 9 i     g       23          b       22
10 j     g       24          b       23
11 k     g       25          b       24

I had the following idea:

library(data.table)
setDT(df1)[, expression := as.numeric(expression1)]
setDT(df2)[, expression := as.numeric(expression2)]
df1[df2, on = .(gene, expression), roll = "nearest"][, expression := NULL][]

but this matches the expression individually and not per sample. How should I approach this?

This is a very easy example to a dataset with a lot of variation. So it is important that it matches a whole sample to a whole sample, and maybe give me back a percentage score of matching expressions (if possible).

R • 964 views
ADD COMMENT
0
Entering edit mode

Depending on how dissimilar the matching samples are between the two datasets the euclidean distance might let you resolve which samples match. After that it's just a matter of joining the gene expression between each sample and its closest nieghbor.

ADD REPLY
0
Entering edit mode

An aside: data_frame is deprecated. Use either tibble() or base::data.frame().

ADD REPLY

Login before adding your answer.

Traffic: 2243 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6