select rows with specific columns and values between two data tables ?
2
0
Entering edit mode
5.5 years ago
star ▴ 350

I have two sorted data tables (A and B), in table A, 3 first columns are genomic coordinates. I like to extract some specific rows from A table but based on some common columns/values of B table.

For example: First, I would like to match column 6 of A table and column 1 of B table, then if there is a 1 value in the third column of B table, select those rows (from A table) which have the same value (first match) in the fourth column of A and B tables and the coordinates before. But if there is a - 1 value in the third column of B table, select those rows (from A table) which have the same value (last match) in the fourth column of A and B tables and the coordinates after.

Input A:

17  50094737    50094898    1   ENST00000007708 ENSG00000005882 
17  50096132    50096286    2   ENST00000007708 ENSG00000005882 
17  50097423    50097564    3   ENST00000007708,ENST00000503176,ENST00000614357 ENSG00000005882
17  50105371    50105442    3   ENST00000007708,ENST00000503176,ENST00000614357 ENSG00000005882
17  50109957    50111058    1   ENST00000007708 ENSG00000005882 
17  50109957    50111368    2   ENST00000614357 ENSG00000005882 
17  50109957    50112152    1   ENST00000503176 ENSG00000005882 
8   22987417    22987563    2   ENST00000519685 ENSG00000008853 
8   22999579    23000105    1   ENST00000251822 ENSG00000008853 
8   23004425    23004626    2   ENST00000251822,ENST00000519685 ENSG00000008853
8   23005372    23005475    2   ENST00000251822,ENST00000519685 ENSG00000008853
8   23006728    23007746    2   ENST00000251822,ENST00000519685 ENSG00000008853
8   23015638    23015743    2   ENST00000251822,ENST00000519685 ENSG00000008853
8   23017252    23020199    1   ENST00000251822 ENSG00000008853 
X   119539386   119539789   1   ENST00000644802 ENSG00000018610 
X   119541328   119541420   2   ENST00000320339,ENST00000644802 ENSG00000018610
X   119544353   119544491   2   ENST00000320339,ENST00000644802 ENSG00000018610
X   119560268   119560385   2   ENST00000320339,ENST00000644802 ENSG00000018610
X   119565232   119565401   1   ENST00000644802 ENSG00000018610 
10  14518560    14521306    1   ENST00000181796 ENSG00000065809 
10  14521158    14521306    1   ENST00000479731 ENSG00000065809 
10  14521869    14522019    8    ENST00000181796,ENST00000378458,ENST00000378467,ENST00000378470,ENST00000468747,ENST00000478076,ENST00000479731,ENST00000622567    ENSG00000065809
10  14530332    14530515    8    ENST00000181796,ENST00000378458,ENST00000378467,ENST00000378470,ENST00000468747,ENST00000478076,ENST00000479731,ENST00000622567    ENSG00000065809
10  14553322    14553387    2   ENST00000378458,ENST00000378467 ENSG00000065809
10  14571763    14572189    2   ENST00000378458,ENST00000622567 ENSG00000065809
10  14572229    14572314    1   ENST00000479731 ENSG00000065809 
10  14667634    14667691    1   ENST00000181796 ENSG00000065809 
10  14774253    14774897    1   ENST00000181796 ENSG00000065809

Input B:

ENSG00000005882 PDK2     1  3
ENSG00000008853 RHOBTB2  1  2
ENSG00000018610 CXorf56 -1  2
ENSG00000065809 FAM107B -1  8

Desire Output:

17  50094737    50094898    1   ENST00000007708 ENSG00000005882 
17  50096132    50096286    2   ENST00000007708 ENSG00000005882 
17  50097423    50097564    3   ENST00000007708,ENST00000503176,ENST00000614357 ENSG00000005882
8   22987417    22987563    2   ENST00000519685 ENSG00000008853 
X   119560268   119560385   2   ENST00000320339,ENST00000644802 ENSG00000018610
X   119565232   119565401   1   ENST00000644802 ENSG00000018610 
10  14530332    14530515    8   ENST00000181796,ENST00000378458,ENST00000378467,ENST00000378470,ENST00000468747,ENST00000478076,ENST00000479731,ENST00000622567 ENSG00000065809
10  14553322    14553387    2   ENST00000378458,ENST00000378467 ENSG00000065809
10  14571763    14572189    2   ENST00000378458,ENST00000622567 ENSG00000065809
10  14572229    14572314    1   ENST00000479731 ENSG00000065809 
10  14667634    14667691    1   ENST00000181796 ENSG00000065809 
10  14774253    14774897    1   ENST00000181796 ENSG00000065809

Thanks a lot in advance for any possible solution!

R linux if-statement dplyr ifelse • 2.0k views
ADD COMMENT
1
Entering edit mode

Hi star, show us what you have tried, now it is like you want us to do your job...

ADD REPLY
0
Entering edit mode

I have tried with If else but, I do not know how to select first and the last match for 1 and -1 value or the coordinates after and before them.

It the cod that I have tried but it is not what I want:

if (A$V6 %in% B$V1 & B$V3 == '1') {
  A[which(A$V4 == B$V4),] & A[which(A$V4 < B$V4),]
} else if  (A$V6 %in% B$V1 & B$V3 == '- 1') {
  A[which(A$V4 == B$V4),] & A[which(A$V4 > B$V4),]
} else {
  print ("NA")
}
ADD REPLY
1
Entering edit mode

I think learning some basic tricks with dplyr would be useful for you.

NB. There seems to be commas in your first table, but not consistent. R might get troubles with that.

ADD REPLY
0
Entering edit mode
dfa = A 
dfb = B  
dfa[which(dfa$V6 == dfb$V1 | (dfb$V3 == 1 & dfb$V4 == dfa$V4) | (dfb$V3 == -1 & dfb$V4 == dfa$V4)),]

I get a slightly different output than your desired one, but maybe this helps you already

ADD REPLY
2
Entering edit mode
5.5 years ago
Chirag Parsania ★ 2.0k
library(tidyverse)

NOTE Due to character limit i removed chr 8 and 10 from input data.

get input_A in R

   ## get input_A in R
input_A <- tibble::tribble(
        ~col_1,    ~col_2,      ~col_3, ~col_4,                                                                                                                            ~col_5,            ~col_6,
        "17",  50094737,  50094898,      1,                                                                                                                 "ENST00000007708", "ENSG00000005882",
        "17",  50096132,  50096286,      2,                                                                                                                 "ENST00000007708", "ENSG00000005882",
        "17",  50097423,  50097564,      3,                                                                                 "ENST00000007708,ENST00000503176,ENST00000614357", "ENSG00000005882",
        "17",  50105371,  50105442,      3,                                                                                 "ENST00000007708,ENST00000503176,ENST00000614357", "ENSG00000005882",
        "17",  50109957,  50111058,      1,                                                                                                                 "ENST00000007708", "ENSG00000005882",
        "17",  50109957,  50111368,      2,                                                                                                                 "ENST00000614357", "ENSG00000005882",
        "17",  50109957,  50112152,      1,                                                                                                                 "ENST00000503176", "ENSG00000005882",
         "X", 119539386, 119539789,      1,                                                                                                                 "ENST00000644802", "ENSG00000018610",
         "X", 119541328, 119541420,      2,                                                                                                 "ENST00000320339,ENST00000644802", "ENSG00000018610",
         "X", 119544353, 119544491,      2,                                                                                                 "ENST00000320339,ENST00000644802", "ENSG00000018610",
         "X", 119560268, 119560385,      2,                                                                                                 "ENST00000320339,ENST00000644802", "ENSG00000018610",
         "X", 119565232, 119565401,      1,                                                                                                                 "ENST00000644802", "ENSG00000018610",
        )%>% 
        rename_all(~(paste(. , "A"  , sep = "_")))

get input_B in R

input_B  <- tibble::tribble(
                   ~col_1,    ~col_2, ~col_3, ~col_4,
        "ENSG00000005882",    "PDK2",      1,      3,
        "ENSG00000008853", "RHOBTB2",      1,      2,
        "ENSG00000018610", "CXorf56",     -1,      2,
        "ENSG00000065809", "FAM107B",     -1,      8
        ) %>% 
        rename_all(~(paste(. , "B"  , sep = "_")))

if there is a 1 value in the 3rd column of B table

out_1  <- input_A %>% left_join(input_B ,  by = c( "col_6_A" = "col_1_B")) %>% 
        filter(col_3_B == 1) %>% ## value 1 in  third column of B table
        filter(col_4_A == col_4_B) ##  rows (from A table) which have the same value in the fourth column of A and B tables

out_1

if there is a -1 value in the 3rd column of B table

out_2  <- input_A %>% left_join(input_B ,  by = c( "col_6_A" = "col_1_B")) %>% 
        filter(col_3_B == -1) %>% ##  value -1 in  third column of B table
        filter(col_4_A == col_4_B) ##   rows (from A table) which have the same value in the fourth column of A and B tables
out_2

What is don't get here is

first match

and

last match

stuff. However, I believe, out_1 and out_2 should be enough to get desire rows and columns.

ADD COMMENT

Login before adding your answer.

Traffic: 3871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6