Consider writing everything to a BED file, do the set operations to split the files, and then do calculations back in R or another scripting language.
In R, we read in the sample data:
dt = structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chrX",
"chrX", "chrX", "chrX"), start = c(842326, 855423, 855426, 855739,
153880833, 153880841, 154298086, 154298089), end = c(842327L,
855424L, 855427L, 855740L, 153880834L, 153880842L, 154298087L,
154298090L), meth.diff = c(9.35200555410902, 19.1839617944039,
29.6734426495636, -12.3375577709254, 50.5830043986142, 52.7503561092491,
46.5783738475184, 41.8662800742733), mean_KO = c(9.35200555410902,
19.1839617944039, 32.962962583692, 1.8512250859083, 51.2741224212646,
53.0928367727283, 47.4901932463221, 44.8441659366298), mean_WT = c(0,
0, 3.28951993412841, 14.1887828568337, 0.69111802265039, 0.34248066347919,
0.91181939880374, 2.97788586235646), coverage_KO = c(139L, 55L,
55L, 270L, 195L, 194L, 131L, 131L), coverage_WT = c(120L, 86L,
87L, 444L, 291L, 293L, 181L, 181L)), .Names = c("chr", "start",
"end", "meth.diff", "mean_KO", "mean_WT", "coverage_KO", "coverage_WT"
), class = c("data.table", "data.frame"), row.names = c(NA, -8L
))
Again in R, we write out the sample data to a BED file. This file is unsorted, so we call it unsorted:
write.table(dt, file='elements.unsorted.bed', col.names=F, row.names=F, quote=F, sep="\t")
On the command line, sort the BED file, to make it a sorted BED file:
$ sort-bed elements.unsorted.bed > elements.bed
Split the elements.bed
file by the sign of the meth.diff
column:
$ awk '$4 < 0' elements.bed > elements.neg.bed
$ awk '$4 >= 0' elements.bed > elements.pos.bed
Here is what these two subsets look like:
$ more elements.neg.bed
chr1 855739 855740 -12.3375577709254 1.8512250859083 14.1887828568337 270 444
$ more elements.pos.bed
chr1 842326 842327 9.35200555410902 9.35200555410902 0 139 120
chr1 855423 855424 19.1839617944039 19.1839617944039 0 55 86
chr1 855426 855427 29.6734426495636 32.962962583692 3.28951993412841 55 87
chrX 153880833 153880834 50.5830043986142 51.2741224212646 0.69111802265039 195 291
chrX 153880841 153880842 52.7503561092491 53.0928367727283 0.34248066347919 194 293
chrX 154298086 154298087 46.5783738475184 47.4901932463221 0.91181939880374 131 181
chrX 154298089 154298090 41.8662800742733 44.8441659366298 2.97788586235646 131 181
Let's work with the subset of positive elements:
$ bedmap --range 500 --echo-map --multidelim '\n' elements.pos.bed | sort-bed - | uniq > posMap.bed
Each element of posMap.bed
contains unique elements that are within 500 bp of at least one other element in that set. We add uniq
to get unique elements, because an element always, by definition, overlaps itself and includes itself in the map result - and we only need one copy of it.
chr1 842326 842327 9.35200555410902 9.35200555410902 0 139 120
chr1 855423 855424 19.1839617944039 19.1839617944039 0 55 86
chr1 855426 855427 29.6734426495636 32.962962583692 3.28951993412841 55 87
chrX 153880833 153880834 50.5830043986142 51.2741224212646 0.69111802265039 195 291
chrX 153880841 153880842 52.7503561092491 53.0928367727283 0.34248066347919 194 293
chrX 154298086 154298087 46.5783738475184 47.4901932463221 0.91181939880374 131 181
chrX 154298089 154298090 41.8662800742733 44.8441659366298 2.97788586235646 131 181
Further, each element of this BED file also contains the score values for columns meth.diff
, mean_KO
, mean_WT
, coverage_KO
and coverage_WT
.
We'll preserve these score columns throughout the pipeline, so that you can use them later on when we are done with set operations.
We filter the elements in this set for any that are too close to elements in the elements.neg.bed
set - within 500 bases.
$ bedops --everything --range 500 posMap.bed | bedops --not-element-of 1 - elements.neg.bed | bedops --everything --range -500 - > filtered.posMap.bed
Each element of filtered.posMap.bed
is now guaranteed to fall at least more than 500 bases away from an element with a negative meth.diff
score.
$ more filtered.posMap.bed
chr1 842326 842327 9.35200555410902 9.35200555410902 0 139 120
chrX 153880833 153880834 50.5830043986142 51.2741224212646 0.69111802265039 195 291
chrX 153880841 153880842 52.7503561092491 53.0928367727283 0.34248066347919 194 293
chrX 154298086 154298087 46.5783738475184 47.4901932463221 0.91181939880374 131 181
chrX 154298089 154298090 41.8662800742733 44.8441659366298 2.97788586235646 131 181
The next goal is to now go back and do a second pass, which finds any positive-meth-diff scoring elements that are within 500 bases of each other:
$ bedmap --range 500 --echo-map --delim '|' --multidelim '|' filtered.posMap.bed | uniq > filtered.posMap.withElementsWithin500Bases.txt
The text file filtered.posMap.withElementsWithin500Bases.txt
contains a list containing a unique map of an element to any potential neighboring elements.
This is no longer a BED file, but we don't care at this point, because we don't need to do any more set operations. The next step involves bringing these elements back in R for calculation. You could also use Python or Perl or any other scripting language, at this stage.
The |
delimiter character indicates where there are two or more neighboring elements.
$ more filtered.posMap.withElementsWithin500Bases.bed
chr1 842326 842327 9.35200555410902 9.35200555410902 0 139 120
chrX 153880833 153880834 50.5830043986142 51.2741224212646 0.69111802265039 195 291|chrX 153880841 153880842 52.7503561092491 53.0928367727283 0.34248066347919 194 293
chrX 154298086 154298087 46.5783738475184 47.4901932463221 0.91181939880374 131 181|chrX 154298089 154298090 41.8662800742733 44.8441659366298 2.97788586235646 131 181
In this example, therefore, the first line has no neighboring elements. You could calculate the weighted mean from this element alone, or choose to filter it out if you wish.
The second and third lines each show a pair of elements that are within 500 bases of one another, with a guarantee that no negative-meth-diff-scoring elements are within 500 bases of either of them.
Since this modified approach gives you all the score columns, you can read these into R/Python/etc. and generate your weighted mean from the desired columns.
One approach is to use the |
delimiter to split up grouped elements into two (or more) separate data frames (say you are using R). You may need more than two frames, if it turns out more than two filtered elements are close to each other and still far away from negative-meth elements. The above result is just an example of a possible outcome.
Once you have all the elements separated, loop through each row to access the frame columns you need from each of the 2..n data frames and calculate the weighted mean or other statistic.
Thanks Alex! This step-by-step was exactly what I needed to get it to work with BEDOPS. I should be able to implement this now.