Question

Matching samples between big datasets

0

Entering edit mode

2.1 years ago

janhenk2333 • 0

Hi all,

I am new here and trying to figure out a specific task:

Lets say I have the following data_frame:

gene = c("a","b","c","d","e","f","g","h","i","j","k", "a","b","c","d","e","f","g","h","i","j","k", "a","b","c","d","e","f","g","h","i","j","k")
sample1 = c("a","a","a","a","a","a","a","a","a","a", "a","b","b","b","b","b","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c")
expression1 = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24","25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "36")

data_frame(gene, sample1, expression1)

This is a fraction of a very big dataset I have containing 20k genes and 700 samples with expression for all genes.

Also, I have a different data_frame containing samples with corresponding gene expression values, but different sample identifiers, lets say the following:

gene = c("a","b","c","d","e","f","g","h","i","j","k")
sample2 = c("g","g","g","g","g","g","g","g","g","g","g")
expression2 = c("14.7", "15", "17", "16", "18", "20", "21", "22", "23", "24", "25")
   gene  sample2 expression2
   <chr> <chr>   <chr>
 1 a     g       14.7
 2 b     g       15
 3 c     g       17
 4 d     g       16
 5 e     g       18
 6 f     g       20
 7 g     g       21
 8 h     g       22
 9 i     g       23
10 j     g       24
11 k     g       25

What I need to do, is match the gene expression from df2 as closely as possible to a sample within df1 and make R report back the corresponding sample identifiers.

It would look something like this maybe:

   gene  sample2 expression2 sample1 expression1
   <chr> <chr>   <chr>       <chr>   <chr>
 1 a     g       14.7        b       14
 2 b     g       15          b       15
 3 c     g       17          b       16
 4 d     g       16          b       17
 5 e     g       18          b       18
 6 f     g       20          b       19
 7 g     g       21          b       20
 8 h     g       22          b       21
 9 i     g       23          b       22
10 j     g       24          b       23
11 k     g       25          b       24

I had the following idea:

library(data.table)
setDT(df1)[, expression := as.numeric(expression1)]
setDT(df2)[, expression := as.numeric(expression2)]
df1[df2, on = .(gene, expression), roll = "nearest"][, expression := NULL][]

but this matches the expression individually and not per sample. How should I approach this?

This is a very easy example to a dataset with a lot of variation. So it is important that it matches a whole sample to a whole sample, and maybe give me back a percentage score of matching expressions (if possible).

R • 960 views

ADD COMMENT • link updated 2.1 years ago by rpolicastro 13k • written 2.1 years ago by janhenk2333 • 0

0

Entering edit mode

Depending on how dissimilar the matching samples are between the two datasets the euclidean distance might let you resolve which samples match. After that it's just a matter of joining the gene expression between each sample and its closest nieghbor.

ADD REPLY • link 2.1 years ago by rpolicastro 13k

0

Entering edit mode

An aside: data_frame is deprecated. Use either tibble() or base::data.frame().

ADD REPLY • link 2.1 years ago by Ram 44k