EDIT: I wrote an R script to do what I want, but I feel like there must be a better/faster way. I added the skeleton/logic of the script is at the bottom of the post, so if you have any ideas on how to make it better, or alternate methods, please let me know.
I have a BEDgraph type data.table with current coordinates of length = 1. I want to collapse coordinates that are located within 500 bp of one another on the same chromosome IF the the sign of the contents of the "meth.diff" column are equal. Additionally I would like to calculate the coverage weighted average of the mean_KO
/mean_WT
columns with the weighting given in the coverage_KO
/coverage_WT
columns for rows that are collapsed.
Here is an example of my input data:
library(data.table)
dt = structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chrX",
"chrX", "chrX", "chrX"), start = c(842326, 855423, 855426, 855739,
153880833, 153880841, 154298086, 154298089), end = c(842327L,
855424L, 855427L, 855740L, 153880834L, 153880842L, 154298087L,
154298090L), meth.diff = c(9.35200555410902, 19.1839617944039,
29.6734426495636, -12.3375577709254, 50.5830043986142, 52.7503561092491,
46.5783738475184, 41.8662800742733), mean_KO = c(9.35200555410902,
19.1839617944039, 32.962962583692, 1.8512250859083, 51.2741224212646,
53.0928367727283, 47.4901932463221, 44.8441659366298), mean_WT = c(0,
0, 3.28951993412841, 14.1887828568337, 0.69111802265039, 0.34248066347919,
0.91181939880374, 2.97788586235646), coverage_KO = c(139L, 55L,
55L, 270L, 195L, 194L, 131L, 131L), coverage_WT = c(120L, 86L,
87L, 444L, 291L, 293L, 181L, 181L)), .Names = c("chr", "start",
"end", "meth.diff", "mean_KO", "mean_WT", "coverage_KO", "coverage_WT"
), class = c("data.table", "data.frame"), row.names = c(NA, -8L
))
What I would like the output to look like (except collapsed row means would be calculated and not in string form):
library(data.table)
dt_output = structure(list(chr = c("chr1", "chr1", "chr1", "chrX", "chrX"
), start = c(842326, 855423, 855739, 153880833, 154298086), end = c(842327,
855427, 855740, 153880842, 154298090), mean_1 = c("9.35", "((19.18*55)/(55+55)) + ((32.96*55)/(55+55))",
"1.85", "((51.27*195)/(195+194)) + ((53.09*194)/(195+194))",
"((47.49*131)/(131+131)) + ((44.84*131)/(131+131))"), mean_2 = c("0",
"((0.00*86)/(86+87)) + ((3.29*87)/(86+87))", "14.19", "((0.69*291)/(291+293)) + ((0.34*293)/(291+293))",
"((0.91*181)/(181+181)) + ((2.98*181)/(181+181))")), .Names = c("chr",
"start", "end", "mean_1", "mean_2"), row.names = c(NA, -5L), class = c("data.table", "data.frame"))
If my request is still unclear, here it is stated explicitly:
1st: I would like to collapse adjacent rows into larger start/end ranges based on the following criteria:
- The two adjacent rows share the same value for the "chr" column (row 1 "chr" = chr1, and row 2 "chr" = chr1)
- The two adjacent rows have "start" (or "end") coordinate within 500 of one another (if row 1 "start" = 1000, and row 2 "start" <= 1499, collapse these into a single row; if row1 = 1000 and row2 = 1500, keep separate)
- The adjacent rows must have the same sign for the "diff" column (i.e. even if chr = chr and start within 500, if diff1 = + 5 and diff2 = -5, keep entries separate)
2nd: I would like to calculate the coverage_
weighted averages of the collapsed mean_KO
/WT
columns with the weighting by the coverage_KO
/WT
columns:
E.g.: collapse 2 rows,
row 1 mean_WT
= 5.0, coverage_WT
= 20.
row 2 mean_WT
=40.0, coverage_WT
= 45.
weighted avg mean_WT
= (((5.020)/(20+45)) + ((40.045)/(20+45))) = 29.23
Help with either part 1 or 2 or any advice is appreciated.
I have been using R for most of my data manipulations, but I am open to any language that can provide a solution. Thanks in advance.
EDIT: R script logic skeleton:
- read file row by row (WHILE n <= length(file))
- IF not on last row of file:
- IF row n and n + 1 within 500bp & meth.diff has same sign
- --> merge sites into a region, calculate coverage weighted avg, save as a single row data.table, and n = n+2 (skip next row (n+1), it was merged)
- ELSE:
- --> save site as a single row data.table, do not calculate new mean, and n = n +1 (go to next line, it was not merged)
- IF row n and n + 1 within 500bp & meth.diff has same sign
- IF on last row of file:
- IF row n,n -1 (within 500 & meth.diff has same sign=TRUE) AND row n-1,n-2 (within 500 & meth.diff same sign=TRUE)
- --> save row as a single row data.table, and n = n+1 (this row was not merged because the two rows before were merged and 2 was added to n, end this round of program, this site will be merged with region below in next round.)
- IF row n,n -1 (within 500 & meth.diff has same sign=TRUE) AND row n-1,n-2 (within 500 & meth.diff same sign=FALSE)
- --> merge rows, save as a single row data.table, and n = n+1 ( sites n-1/n-2 weren't merged, and n/n-1 should be merged, so merge into a region, calculate coverage weighted avg)
- IF row n,n -1 (within 500 & meth.diff has same sign=FALSE)
- --> save row as a single row data.table, and n = n+1 (This site should not be merged with site below, so save as is)
- IF row n,n -1 (within 500 & meth.diff has same sign=TRUE) AND row n-1,n-2 (within 500 & meth.diff same sign=TRUE)
- Concatenate the data.table rows created for merged/unmerged sites, and feed this new data.table back into the function until successive iterations produce an identical result.
Thanks Alex! This step-by-step was exactly what I needed to get it to work with BEDOPS. I should be able to implement this now.