Hi all,
I am new here and trying to figure out a specific task:
Lets say I have the following data_frame
:
gene = c("a","b","c","d","e","f","g","h","i","j","k", "a","b","c","d","e","f","g","h","i","j","k", "a","b","c","d","e","f","g","h","i","j","k")
sample1 = c("a","a","a","a","a","a","a","a","a","a", "a","b","b","b","b","b","b","b","b","b","b","b","c","c","c","c","c","c","c","c","c","c","c")
expression1 = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24","25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "36")
data_frame(gene, sample1, expression1)
This is a fraction of a very big dataset I have containing 20k genes and 700 samples with expression for all genes.
Also, I have a different data_frame
containing samples with corresponding gene expression values, but different sample identifiers, lets say the following:
gene = c("a","b","c","d","e","f","g","h","i","j","k")
sample2 = c("g","g","g","g","g","g","g","g","g","g","g")
expression2 = c("14.7", "15", "17", "16", "18", "20", "21", "22", "23", "24", "25")
gene sample2 expression2
<chr> <chr> <chr>
1 a g 14.7
2 b g 15
3 c g 17
4 d g 16
5 e g 18
6 f g 20
7 g g 21
8 h g 22
9 i g 23
10 j g 24
11 k g 25
What I need to do, is match the gene expression from df2
as closely as possible to a sample within df1
and make R report back the corresponding sample identifiers.
It would look something like this maybe:
gene sample2 expression2 sample1 expression1
<chr> <chr> <chr> <chr> <chr>
1 a g 14.7 b 14
2 b g 15 b 15
3 c g 17 b 16
4 d g 16 b 17
5 e g 18 b 18
6 f g 20 b 19
7 g g 21 b 20
8 h g 22 b 21
9 i g 23 b 22
10 j g 24 b 23
11 k g 25 b 24
I had the following idea:
library(data.table)
setDT(df1)[, expression := as.numeric(expression1)]
setDT(df2)[, expression := as.numeric(expression2)]
df1[df2, on = .(gene, expression), roll = "nearest"][, expression := NULL][]
but this matches the expression individually and not per sample. How should I approach this?
This is a very easy example to a dataset with a lot of variation. So it is important that it matches a whole sample to a whole sample, and maybe give me back a percentage score of matching expressions (if possible).
Depending on how dissimilar the matching samples are between the two datasets the euclidean distance might let you resolve which samples match. After that it's just a matter of joining the gene expression between each sample and its closest nieghbor.
An aside:
data_frame
is deprecated. Use eithertibble()
orbase::data.frame()
.