How to use R's loop to get the part of the aggregate matrix other than the many subset matrices?
1
4
Entering edit mode
2.5 years ago
Chilly ★ 1.3k

I have two matrices of pure numbers, and their format is the same: the first column is the group name, the second is the starting number, and the third is the ending number. Each group is the name of a chromosome. All data were transformed from fasta data. I am trying to generate a blacklist(table3) with the whitelist(table2) by taking the whole genome(table1) as the base.

E.g (*spaces in the following tables represent column change): A row of table1:

scahrs1_999 1 12
#Means the total set of numbers for the 'scahrs1_999' group is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.

Some rows of table2:

scahrs1_999 2 3
scahrs1_999 6 8
scahrs1_999 11 12
#Means that the subsets of numbers in the 'scahrs1_999' group are '2, 3', '5, 6, 7', and '10, 11'.

The result I want to get is a set of numbers that do not contain any subsets in the total set, and behaves the same as a subset of table2's consecutive numbers. which is:

table3 (results):

scahrs1_999 1
scahrs1_999 4 5
scahrs1_999 9 10
scahrs1_999 13 14

Also exclude subsets with only 1 number. which is:

table3 (final result):

scahrs1_999 4 5
scahrs1_999 9 10
scahrs1_999 13 14

As shown below, I have thousands of groups similar to the above 'scahrs1_999'. Obviously, I can't calculate them one by one. I know that R can loop through each group with a 'loop program' and get the corresponding result. But my programming ability is not up to this complex job.

Could someone give me some advice?

> Table1

1  scahrs1_1001  1     81142
2  scahrs1_1002  1     62661
3  scahrs1_1003  1    104891
4  scahrs1_1004  1     99296
5  scahrs1_1005  1     30919
6  scahrs1_1006  1     97599
7  scahrs1_1008  1     97078
8  scahrs1_1009  1     96958
9  scahrs1_1010  1     45677

> Table2

1  scahrs1_1001      1    753
2  scahrs1_1001  14931  15932
3  scahrs1_1001  17007  18008
4  scahrs1_1001  21211  22212
5  scahrs1_1001  40908  41909
6  scahrs1_1001  63233  64234
7  scahrs1_1001  76009  77010
8  scahrs1_1002   1068   2069
9  scahrs1_1002  12992  13993
10 scahrs1_1002  40448  41449
11 scahrs1_1003   2227   3228
12 scahrs1_1003  18453  19454
13 scahrs1_1003  28679  29680
14 scahrs1_1003  41161  42162
15 scahrs1_1003  41735  42736
16 scahrs1_1003  43040  44041
17 scahrs1_1003  64416  65417
18 scahrs1_1003  71219  72220
19 scahrs1_1003  96090  97091
20 scahrs1_1003  97306  98307
21 scahrs1_1004   1554   2555
22 scahrs1_1004  29086  30087
23 scahrs1_1004  44100  45101
24 scahrs1_1004  47799  48800
25 scahrs1_1004  59550  60551
26 scahrs1_1004  69356  70357
27 scahrs1_1004  71809  72810
28 scahrs1_1004  84272  85273
29 scahrs1_1004  89034  90035
30 scahrs1_1004  98627  99628
31 scahrs1_1005   6695   7696
32 scahrs1_1005  30160  31161
33 scahrs1_1006    298   1299
34 scahrs1_1006  70134  71135
35 scahrs1_1006  93750  94751
36 scahrs1_1008   3859   4860
37 scahrs1_1008   5575   6576
38 scahrs1_1008   7072   8073
39 scahrs1_1008   9342  10343
40 scahrs1_1008  11814  12815
41 scahrs1_1008  15290  16291
42 scahrs1_1008  40167  41168
43 scahrs1_1008  42890  43891
44 scahrs1_1008  44806  45807
45 scahrs1_1008  74442  75443
46 scahrs1_1008  82112  83113
47 scahrs1_1008  93766  94767
48 scahrs1_1008  95233  96234
49 scahrs1_1009   8000   9001
50 scahrs1_1009  37369  38370
51 scahrs1_1009  53086  54087
52 scahrs1_1009  83722  84723
53 scahrs1_1009  90044  91045
54 scahrs1_1010  11341  12342
55 scahrs1_1010  33500  34501
56 scahrs1_1010  34931  35932
57 scahrs1_1010  37937  38938
aggregate r matrix loop subset • 1.8k views
ADD COMMENT
0
Entering edit mode

How is this related to bioinformatics?

ADD REPLY
4
Entering edit mode

Yes. In fact, each group is the name of a chromosome. All data were transformed from fasta data. I am trying to generate a blacklist(table3) with the whitelist(table2) by taking the whole genome(table1) as the base.

ADD REPLY
1
Entering edit mode

Please edit your question and add this information at the beginning.

ADD REPLY
4
Entering edit mode

Thx, I have updated it.

ADD REPLY
0
Entering edit mode

I am not entirely sure what you are trying to achieve, but I think the Genomic Ranges package, and it's interval operations (e.g. setdiff) respectively looping capabilities will help you a lot to implement it.

ADD REPLY
4
Entering edit mode
2.5 years ago

Usually you want to operate on genomic ranges in R with GRanges objects.

Your example data.

table1 <- structure(list(V1 = "scahrs1_999", V2 = 1L, V3 = 12L), class = "data.frame", row.names = c(NA, 
-1L))

table2 <- structure(list(V1 = c("scahrs1_999", "scahrs1_999", "scahrs1_999"
), V2 = c(2L, 6L, 11L), V3 = c(3L, 8L, 12L)), class = "data.frame", row.names = c(NA, 
-3L))

GRanges answer.

library("GenomicRanges")

gr1 <- makeGRangesFromDataFrame(table1, seqnames.field="V1", start.field="V2", end.field="V3")
gr2 <- makeGRangesFromDataFrame(table2, seqnames.field="V1", start.field="V2", end.field="V3")

result <- setdiff(gr1, gr2)
result <- result[width(result) > 1]

And the result of the above code.

> result
GRanges object with 2 ranges and 0 metadata columns:
         seqnames    ranges strand
            <Rle> <IRanges>  <Rle>
  [1] scahrs1_999       4-5      *
  [2] scahrs1_999      9-10      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

There is a slight difference from your results because it doesn't follow from your data why there should be a range from 13-14.

ADD COMMENT
2
Entering edit mode

Thx! Perfectly solved my problem!

ADD REPLY

Login before adding your answer.

Traffic: 1891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6