Below is a small part of my data table. Each row shown in the table represents a transcript, the first column is the chromosome name, the second column is the transcript start site, and the third column is the transcript end site.
What I want to do now is to compare the 'upper row' end site with the 'next row' start site of every two adjacent rows within the same chromosome. If the 'upper row' end site is greater than the 'next row' start site, i.e. the two transcripts overlap, merge the two transcripts into one and use the 'upper row' start site and the 'next line' end site. It should be noted that I am not comparing 'every pair' of adjacent rows, but only adjacent rows within the same chromosome; this means that adjacent rows with different first column names are not compared at all.
scahrs1_1000 84808 85809
scahrs1_1001 1 753
scahrs1_1001 14931 15932
scahrs1_1001 15232 18008
scahrs1_1001 21211 22212
scahrs1_1001 40908 41909
scahrs1_1001 63233 64234
scahrs1_1001 76009 77010
scahrs1_1002 1068 2069
scahrs1_1002 12992 13993
scahrs1_1002 40448 41449
scahrs1_1003 2227 3228
scahrs1_1003 18453 19454
scahrs1_1003 28679 29680
scahrs1_1003 41161 42162
scahrs1_1003 41735 42736
scahrs1_1003 41867 44041
scahrs1_1003 64416 65417
scahrs1_1003 71219 72220
scahrs1_1003 96090 97091
scahrs1_1003 96754 98307
scahrs1_1004 1554 2555
scahrs1_1004 29086 30087
scahrs1_1004 44100 45101
scahrs1_1004 47799 48800
scahrs1_1004 59550 60551
scahrs1_1004 69356 70357
scahrs1_1004 71809 72810
scahrs1_1004 84272 85273
scahrs1_1004 89034 90035
scahrs1_1004 98627 99628
scahrs1_1005 6695 7696
scahrs1_1005 30160 31161
scahrs1_1006 298 1299
scahrs1_1006 70134 71135
scahrs1_1006 93750 94751
scahrs1_1008 3859 4860
scahrs1_1008 5575 6576
scahrs1_1008 7072 8073
scahrs1_1008 9342 10343
scahrs1_1008 11814 12815
scahrs1_1008 15290 16291
scahrs1_1008 40167 41168
scahrs1_1008 42890 43891
scahrs1_1008 44806 45807
scahrs1_1008 74442 75443
scahrs1_1008 82112 83113
scahrs1_1008 93766 94767
scahrs1_1008 95233 96234
scahrs1_1009 8000 9001
scahrs1_1009 37369 38370
scahrs1_1009 53086 54087
scahrs1_1009 83722 84723
scahrs1_1009 83994 91045
scahrs1_1010 11341 12342
scahrs1_1010 33500 34501
scahrs1_1010 34931 35932
scahrs1_1010 37937 38938
The output of the loop I want to get is:
scahrs1_1000 84808 85809
scahrs1_1001 1 753
scahrs1_1001 14931 18008
scahrs1_1001 21211 22212
scahrs1_1001 40908 41909
scahrs1_1001 63233 64234
scahrs1_1001 76009 77010
scahrs1_1002 1068 2069
scahrs1_1002 12992 13993
scahrs1_1002 40448 41449
scahrs1_1003 2227 3228
scahrs1_1003 18453 19454
scahrs1_1003 28679 29680
scahrs1_1003 41161 44041
scahrs1_1003 64416 65417
scahrs1_1003 71219 72220
scahrs1_1003 96090 98307
scahrs1_1004 1554 2555
scahrs1_1004 29086 30087
scahrs1_1004 44100 45101
scahrs1_1004 47799 48800
scahrs1_1004 59550 60551
scahrs1_1004 69356 70357
scahrs1_1004 71809 72810
scahrs1_1004 84272 85273
scahrs1_1004 89034 90035
scahrs1_1004 98627 99628
scahrs1_1005 6695 7696
scahrs1_1005 30160 31161
Where ‘scahrs1_1001’ merges and reduces by 1 row, and ‘scahrs1_1003’ merges and reduces by 3 rows. I have many rows of data starting with 'scahrs1...'. I think it's might easy to achieve via a loop in R, could anyone give me some advice?
I'm not completely sure what you mean, but if you want to merge all overlapping ranges within the same chromosome you can use the
reduce
function fromGenomicRanges
.Your example data.
GRanges code.
Results.
I was too lazy to cross reference with your desired results so feel free to yell at me if they don't make sense.
Thank you!