I have two matrices of pure numbers, and their format is the same: the first column is the group name, the second is the starting number, and the third is the ending number. Each group is the name of a chromosome. All data were transformed from fasta data. I am trying to generate a blacklist(table3) with the whitelist(table2) by taking the whole genome(table1) as the base.
E.g (*spaces in the following tables represent column change): A row of table1:
scahrs1_999 1 12
#Means the total set of numbers for the 'scahrs1_999' group is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.
Some rows of table2:
scahrs1_999 2 3
scahrs1_999 6 8
scahrs1_999 11 12
#Means that the subsets of numbers in the 'scahrs1_999' group are '2, 3', '5, 6, 7', and '10, 11'.
The result I want to get is a set of numbers that do not contain any subsets in the total set, and behaves the same as a subset of table2's consecutive numbers. which is:
table3 (results):
scahrs1_999 1
scahrs1_999 4 5
scahrs1_999 9 10
scahrs1_999 13 14
Also exclude subsets with only 1 number. which is:
table3 (final result):
scahrs1_999 4 5
scahrs1_999 9 10
scahrs1_999 13 14
As shown below, I have thousands of groups similar to the above 'scahrs1_999'. Obviously, I can't calculate them one by one. I know that R can loop through each group with a 'loop program' and get the corresponding result. But my programming ability is not up to this complex job.
Could someone give me some advice?
> Table1
1 scahrs1_1001 1 81142
2 scahrs1_1002 1 62661
3 scahrs1_1003 1 104891
4 scahrs1_1004 1 99296
5 scahrs1_1005 1 30919
6 scahrs1_1006 1 97599
7 scahrs1_1008 1 97078
8 scahrs1_1009 1 96958
9 scahrs1_1010 1 45677
> Table2
1 scahrs1_1001 1 753
2 scahrs1_1001 14931 15932
3 scahrs1_1001 17007 18008
4 scahrs1_1001 21211 22212
5 scahrs1_1001 40908 41909
6 scahrs1_1001 63233 64234
7 scahrs1_1001 76009 77010
8 scahrs1_1002 1068 2069
9 scahrs1_1002 12992 13993
10 scahrs1_1002 40448 41449
11 scahrs1_1003 2227 3228
12 scahrs1_1003 18453 19454
13 scahrs1_1003 28679 29680
14 scahrs1_1003 41161 42162
15 scahrs1_1003 41735 42736
16 scahrs1_1003 43040 44041
17 scahrs1_1003 64416 65417
18 scahrs1_1003 71219 72220
19 scahrs1_1003 96090 97091
20 scahrs1_1003 97306 98307
21 scahrs1_1004 1554 2555
22 scahrs1_1004 29086 30087
23 scahrs1_1004 44100 45101
24 scahrs1_1004 47799 48800
25 scahrs1_1004 59550 60551
26 scahrs1_1004 69356 70357
27 scahrs1_1004 71809 72810
28 scahrs1_1004 84272 85273
29 scahrs1_1004 89034 90035
30 scahrs1_1004 98627 99628
31 scahrs1_1005 6695 7696
32 scahrs1_1005 30160 31161
33 scahrs1_1006 298 1299
34 scahrs1_1006 70134 71135
35 scahrs1_1006 93750 94751
36 scahrs1_1008 3859 4860
37 scahrs1_1008 5575 6576
38 scahrs1_1008 7072 8073
39 scahrs1_1008 9342 10343
40 scahrs1_1008 11814 12815
41 scahrs1_1008 15290 16291
42 scahrs1_1008 40167 41168
43 scahrs1_1008 42890 43891
44 scahrs1_1008 44806 45807
45 scahrs1_1008 74442 75443
46 scahrs1_1008 82112 83113
47 scahrs1_1008 93766 94767
48 scahrs1_1008 95233 96234
49 scahrs1_1009 8000 9001
50 scahrs1_1009 37369 38370
51 scahrs1_1009 53086 54087
52 scahrs1_1009 83722 84723
53 scahrs1_1009 90044 91045
54 scahrs1_1010 11341 12342
55 scahrs1_1010 33500 34501
56 scahrs1_1010 34931 35932
57 scahrs1_1010 37937 38938
How is this related to bioinformatics?
Yes. In fact, each group is the name of a chromosome. All data were transformed from fasta data. I am trying to generate a blacklist(table3) with the whitelist(table2) by taking the whole genome(table1) as the base.
Please edit your question and add this information at the beginning.
Thx, I have updated it.
I am not entirely sure what you are trying to achieve, but I think the Genomic Ranges package, and it's interval operations (e.g. setdiff) respectively looping capabilities will help you a lot to implement it.