Question

merge sites of BEDgraph file and calculate coverage weighted average of collapsed scores

0

Entering edit mode

9.4 years ago

reilly.brian.m ▴ 70

EDIT: I wrote an R script to do what I want, but I feel like there must be a better/faster way. I added the skeleton/logic of the script is at the bottom of the post, so if you have any ideas on how to make it better, or alternate methods, please let me know.

I have a BEDgraph type data.table with current coordinates of length = 1. I want to collapse coordinates that are located within 500 bp of one another on the same chromosome IF the the sign of the contents of the "meth.diff" column are equal. Additionally I would like to calculate the coverage weighted average of the mean_KO/mean_WT columns with the weighting given in the coverage_KO/coverage_WT columns for rows that are collapsed.

Here is an example of my input data:

library(data.table)
dt = structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chrX",
"chrX", "chrX", "chrX"), start = c(842326, 855423, 855426, 855739,
153880833, 153880841, 154298086, 154298089), end = c(842327L,
855424L, 855427L, 855740L, 153880834L, 153880842L, 154298087L,
154298090L), meth.diff = c(9.35200555410902, 19.1839617944039,
29.6734426495636, -12.3375577709254, 50.5830043986142, 52.7503561092491,
46.5783738475184, 41.8662800742733), mean_KO = c(9.35200555410902,
19.1839617944039, 32.962962583692, 1.8512250859083, 51.2741224212646,
53.0928367727283, 47.4901932463221, 44.8441659366298), mean_WT = c(0,
0, 3.28951993412841, 14.1887828568337, 0.69111802265039, 0.34248066347919,
0.91181939880374, 2.97788586235646), coverage_KO = c(139L, 55L,
55L, 270L, 195L, 194L, 131L, 131L), coverage_WT = c(120L, 86L,
87L, 444L, 291L, 293L, 181L, 181L)), .Names = c("chr", "start",
"end", "meth.diff", "mean_KO", "mean_WT", "coverage_KO", "coverage_WT"
), class = c("data.table", "data.frame"), row.names = c(NA, -8L
))

What I would like the output to look like (except collapsed row means would be calculated and not in string form):

library(data.table)
dt_output = structure(list(chr = c("chr1", "chr1", "chr1", "chrX", "chrX"
), start = c(842326, 855423, 855739, 153880833, 154298086), end = c(842327, 
855427, 855740, 153880842, 154298090), mean_1 = c("9.35", "((19.18*55)/(55+55)) + ((32.96*55)/(55+55))", 
"1.85", "((51.27*195)/(195+194)) + ((53.09*194)/(195+194))", 
"((47.49*131)/(131+131)) + ((44.84*131)/(131+131))"), mean_2 = c("0", 
"((0.00*86)/(86+87)) + ((3.29*87)/(86+87))", "14.19", "((0.69*291)/(291+293)) + ((0.34*293)/(291+293))", 
"((0.91*181)/(181+181)) + ((2.98*181)/(181+181))")), .Names = c("chr", 
"start", "end", "mean_1", "mean_2"), row.names = c(NA, -5L), class = c("data.table", "data.frame"))

If my request is still unclear, here it is stated explicitly:

1st: I would like to collapse adjacent rows into larger start/end ranges based on the following criteria:

The two adjacent rows share the same value for the "chr" column (row 1 "chr" = chr1, and row 2 "chr" = chr1)
The two adjacent rows have "start" (or "end") coordinate within 500 of one another (if row 1 "start" = 1000, and row 2 "start" <= 1499, collapse these into a single row; if row1 = 1000 and row2 = 1500, keep separate)
The adjacent rows must have the same sign for the "diff" column (i.e. even if chr = chr and start within 500, if diff1 = + 5 and diff2 = -5, keep entries separate)

2nd: I would like to calculate the coverage_ weighted averages of the collapsed mean_KO/WT columns with the weighting by the coverage_KO/WT columns:

E.g.: collapse 2 rows,

row 1 mean_WT = 5.0, coverage_WT = 20.

row 2 mean_WT =40.0, coverage_WT = 45.

weighted avg mean_WT = (((5.020)/(20+45)) + ((40.045)/(20+45))) = 29.23

Help with either part 1 or 2 or any advice is appreciated.

I have been using R for most of my data manipulations, but I am open to any language that can provide a solution. Thanks in advance.

EDIT: R script logic skeleton:

read file row by row (WHILE n <= length(file))
IF not on last row of file:
1. IF row n and n + 1 within 500bp & meth.diff has same sign
  1. --> merge sites into a region, calculate coverage weighted avg, save as a single row data.table, and n = n+2 (skip next row (n+1), it was merged)
2. ELSE:
  1. --> save site as a single row data.table, do not calculate new mean, and n = n +1 (go to next line, it was not merged)
IF on last row of file:
1. IF row n,n -1 (within 500 & meth.diff has same sign=TRUE) AND row n-1,n-2 (within 500 & meth.diff same sign=TRUE)
  1. --> save row as a single row data.table, and n = n+1 (this row was not merged because the two rows before were merged and 2 was added to n, end this round of program, this site will be merged with region below in next round.)
2. IF row n,n -1 (within 500 & meth.diff has same sign=TRUE) AND row n-1,n-2 (within 500 & meth.diff same sign=FALSE)
  1. --> merge rows, save as a single row data.table, and n = n+1 ( sites n-1/n-2 weren't merged, and n/n-1 should be merged, so merge into a region, calculate coverage weighted avg)
3. IF row n,n -1 (within 500 & meth.diff has same sign=FALSE)
  1. --> save row as a single row data.table, and n = n+1 (This site should not be merged with site below, so save as is)
Concatenate the data.table rows created for merged/unmerged sites, and feed this new data.table back into the function until successive iterations produce an identical result.

bed bedtools merge R data.table • 4.6k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.4 years ago by reilly.brian.m ▴ 70

1

Entering edit mode

9.4 years ago

Alex Reynolds 36k

You could do this on the command line with awk and BEDOPS bedmap.

First, preprocess your Bedgraph files by normalizing by the coverage. This can be done in R, since you have the coverage vector there. Preprocessing gets the scores ready for later use.

Then separate preprocessed score elements by their score sign and turn them into sorted BED files.

Say your preprocessed elements are in a file called elements.bedgraph:

$ awk '$4 < 0' elements.bedgraph | awk '{ print $1"\t"$2"\t"$3"\t.\t"$4; }' | sort-bed - > elements.neg.bed
$ awk '$4 >= 0' elements.bedgraph | awk '{ print $1"\t"$2"\t"$3"\t.\t"$4; }' | sort-bed - > elements.pos.bed

Second, apply a bedmap operation on each signed subset. Pad elements by 500 bp with the option --range 500. Elements that are between 1 and 500 bases apart will associate. If you want the mean of the preprocessed score data over the mapped regions, add the --mean operator:

$ bedmap --range 500 --echo --mean --delim '\t' elements.pos.bed > elements.pos.pad500ntMapAndMean.bed

This shows each element of elements.pos.bed and, for each element, the mean of preprocessed scores of all other elements in this set that overlap within 500 bases.

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Hey Alex, thanks for your help!

Your solution almost works for my situation. The issue here is that if I create two separate bed files, one with positive values and one with negative values, then when I merge sites within 500bp for the positive bedgraph file, I could be merging over a span where there was a site in the negative bedgraph file. I would like to only merge the sites that contiguously had the same sign in the score file.

For example, given the three sites below, I would like to not merge the two positive sites even though they are within 500bp, because there is a negative site in between them.

chr1 234 235 +34
chr1 239 240 -23
chr1 250 251 +26

Do you see what I mean?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by reilly.brian.m ▴ 70

1

Entering edit mode

With some more set operations, I think you could get to your answer.

You could merge positive elements within 500 bp, and do the same with negative elements within 500 bp:

$ bedmap --range 500 --echo-map --multidelim '\n' elements.pos.bed | sort-bed - > posMap.bed
$ bedmap --range 500 --echo-map --multidelim '\n' elements.neg.bed | sort-bed - > negMap.bed

We pipe the result to sort-bed in both cases, because in this special case, the mapped BED elements coming out of the bedmap step are not guaranteed to be in sorted order.

We have two sets called posMap.bed and negMap.bed.

Within one set, each element of that set contains elements that are within 500 bp of at least one other element in that set. This is the case of the posMap set and also of the negMap set.

Next, we can pad each element of posMap.bed by 500 bases, and filter them against the elements in the original negative set elements.neg.bed using bedops --not-element-of:

$ bedops --everything --range 500 posMap.bed | bedops --not-element-of 1 - elements.neg.bed | bedops --everything --range -500 - > filtered.posMap.bed

We can do the same for negMap.bed:

$ bedops --everything --range 500 negMap.bed | bedops --not-element-of 1 - elements.pos.bed | bedops --everything --range -500 - > filtered.negMap.bed

With this series of operations, a positive-filtered set should not overlap any elements in the negative-score set within 500 bases, and likewise for the negative-filtered set.

Once you have these filtered elements, you can re-map to get all (positive or negative) elements that map within 500 bases, which exclude (negative or positive, resp.) elements inside that 500 base window, and take the mean of their scores:

$ bedmap --range 500 --echo --mean --delim '\t' filtered.posMap.bed > filtered.posMap.withMeanOfElementsWithin500Bases.bed

And do the same for negative-filtered elements (if needed):

$ bedmap --range 500 --echo --mean --delim '\t' filtered.negMap.bed > filtered.negMap.withMeanOfElementsWithin500Bases.bed

I think this should work to exclude the elements in your scenario.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks a lot for taking the time to think about this issue.

I am not really familiar with BEDOPS (other than hearing of it before), I usually use bedtools, which from what I can gather is somewhat similar. I will take a look at BEDOPS now to see if your strategy will work for me.

In the mean time let me just ask you this: Using the pipeline/strategy you laid out, do you think it would be feasible to calculate the means of merged sites, weighted by their unmerged value in the coverage_ column? This is in regards to part 2 of my question. I would like to calculate the coverage weighted means of the merged areas.

I ended up writing an R script to do this for me in the mean time, but it is really ugly and I'm afraid it is going to be very slow, so I am interested in trying to get BEDOPS to work so I don't waste time when I process more files.

Thanks again for your help

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by reilly.brian.m ▴ 70

0

Entering edit mode

That was what I meant by preprocessing the scores, since you have weights for each region. You could do this in R, then write.table() to write out BED files. Once you have done the set operations, you are then calculating the mean of preprocessed (pre-weighted) scores. If you're more familiar with bedtools, you could use that instead, assuming it does the same set and map operations I outlined above. I'm not really familiar with it so I can't speak to its use here, but there are lots of bedtools people here, so I'm sure someone who knows could comment.

ADD REPLY • link 9.4 years ago by Alex Reynolds 36k

0

Entering edit mode

I am not exactly sure what you mean by "preprocess the scores"? I am trying to calculate the scores based on whether they merge or not, so how could I pre-calculate if I don't yet know what the overall merged region will contain?

For example, if there are 3 sites that are merged:

Coverage weighted mean = (val1 * coverage1 + val2* coverage2 + val3*coverage3) / (coverage1+2+3)

I could calculate the weight for a site (i.e. val1*coverage1), but I must know how many and which sites will merge in order to calculate the denominator for the equation right? Or am I missing something simple here?

Thanks again for your help. By the way I've started looking into BEDOPS and now I definitely plan on using this for some of my downstream analyses, so thanks for pointing me in that direction.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by reilly.brian.m ▴ 70

Ram · Accepted Answer · 2015-12-14

Consider writing everything to a BED file, do the set operations to split the files, and then do calculations back in R or another scripting language.

In R, we read in the sample data:

dt = structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chrX",
"chrX", "chrX", "chrX"), start = c(842326, 855423, 855426, 855739,
153880833, 153880841, 154298086, 154298089), end = c(842327L,
855424L, 855427L, 855740L, 153880834L, 153880842L, 154298087L,
154298090L), meth.diff = c(9.35200555410902, 19.1839617944039,
29.6734426495636, -12.3375577709254, 50.5830043986142, 52.7503561092491,
46.5783738475184, 41.8662800742733), mean_KO = c(9.35200555410902,
19.1839617944039, 32.962962583692, 1.8512250859083, 51.2741224212646,
53.0928367727283, 47.4901932463221, 44.8441659366298), mean_WT = c(0,
0, 3.28951993412841, 14.1887828568337, 0.69111802265039, 0.34248066347919,
0.91181939880374, 2.97788586235646), coverage_KO = c(139L, 55L,
55L, 270L, 195L, 194L, 131L, 131L), coverage_WT = c(120L, 86L,
87L, 444L, 291L, 293L, 181L, 181L)), .Names = c("chr", "start",
"end", "meth.diff", "mean_KO", "mean_WT", "coverage_KO", "coverage_WT"
), class = c("data.table", "data.frame"), row.names = c(NA, -8L
))

Again in R, we write out the sample data to a BED file. This file is unsorted, so we call it unsorted:

write.table(dt, file='elements.unsorted.bed', col.names=F, row.names=F, quote=F, sep="\t")

On the command line, sort the BED file, to make it a sorted BED file:

$ sort-bed elements.unsorted.bed > elements.bed

Split the elements.bed file by the sign of the meth.diff column:

$ awk '$4 < 0' elements.bed > elements.neg.bed
$ awk '$4 >= 0' elements.bed > elements.pos.bed

Here is what these two subsets look like:

$ more elements.neg.bed 
chr1    855739  855740  -12.3375577709254       1.8512250859083 14.1887828568337        270     444
$ more elements.pos.bed 
chr1    842326  842327  9.35200555410902        9.35200555410902        0       139     120
chr1    855423  855424  19.1839617944039        19.1839617944039        0       55      86
chr1    855426  855427  29.6734426495636        32.962962583692 3.28951993412841        55      87
chrX    153880833       153880834       50.5830043986142        51.2741224212646        0.69111802265039        195     291
chrX    153880841       153880842       52.7503561092491        53.0928367727283        0.34248066347919        194     293
chrX    154298086       154298087       46.5783738475184        47.4901932463221        0.91181939880374        131     181
chrX    154298089       154298090       41.8662800742733        44.8441659366298        2.97788586235646        131     181

Let's work with the subset of positive elements:

$ bedmap --range 500 --echo-map --multidelim '\n' elements.pos.bed | sort-bed - | uniq > posMap.bed

Each element of posMap.bed contains unique elements that are within 500 bp of at least one other element in that set. We add uniq to get unique elements, because an element always, by definition, overlaps itself and includes itself in the map result - and we only need one copy of it.

chr1    842326  842327  9.35200555410902        9.35200555410902        0       139     120
chr1    855423  855424  19.1839617944039        19.1839617944039        0       55      86
chr1    855426  855427  29.6734426495636        32.962962583692 3.28951993412841        55      87
chrX    153880833       153880834       50.5830043986142        51.2741224212646        0.69111802265039        195     291
chrX    153880841       153880842       52.7503561092491        53.0928367727283        0.34248066347919        194     293
chrX    154298086       154298087       46.5783738475184        47.4901932463221        0.91181939880374        131     181
chrX    154298089       154298090       41.8662800742733        44.8441659366298        2.97788586235646        131     181

Further, each element of this BED file also contains the score values for columns meth.diff, mean_KO, mean_WT, coverage_KO and coverage_WT.

We'll preserve these score columns throughout the pipeline, so that you can use them later on when we are done with set operations.

We filter the elements in this set for any that are too close to elements in the elements.neg.bed set - within 500 bases.

$ bedops --everything --range 500 posMap.bed | bedops --not-element-of 1 - elements.neg.bed | bedops --everything --range -500 - > filtered.posMap.bed

Each element of filtered.posMap.bed is now guaranteed to fall at least more than 500 bases away from an element with a negative meth.diff score.

$ more filtered.posMap.bed 
chr1    842326  842327  9.35200555410902        9.35200555410902        0       139     120
chrX    153880833       153880834       50.5830043986142        51.2741224212646        0.69111802265039        195     291
chrX    153880841       153880842       52.7503561092491        53.0928367727283        0.34248066347919        194     293
chrX    154298086       154298087       46.5783738475184        47.4901932463221        0.91181939880374        131     181
chrX    154298089       154298090       41.8662800742733        44.8441659366298        2.97788586235646        131     181

The next goal is to now go back and do a second pass, which finds any positive-meth-diff scoring elements that are within 500 bases of each other:

$ bedmap --range 500 --echo-map --delim '|' --multidelim '|' filtered.posMap.bed | uniq > filtered.posMap.withElementsWithin500Bases.txt

The text file filtered.posMap.withElementsWithin500Bases.txt contains a list containing a unique map of an element to any potential neighboring elements.

This is no longer a BED file, but we don't care at this point, because we don't need to do any more set operations. The next step involves bringing these elements back in R for calculation. You could also use Python or Perl or any other scripting language, at this stage.

The | delimiter character indicates where there are two or more neighboring elements.

$ more filtered.posMap.withElementsWithin500Bases.bed
chr1    842326    842327    9.35200555410902    9.35200555410902    0    139    120
chrX    153880833    153880834    50.5830043986142    51.2741224212646    0.69111802265039    195    291|chrX    153880841    153880842    52.7503561092491    53.0928367727283    0.34248066347919    194    293
chrX    154298086    154298087    46.5783738475184    47.4901932463221    0.91181939880374    131    181|chrX    154298089    154298090    41.8662800742733    44.8441659366298    2.97788586235646    131    181

In this example, therefore, the first line has no neighboring elements. You could calculate the weighted mean from this element alone, or choose to filter it out if you wish.

The second and third lines each show a pair of elements that are within 500 bases of one another, with a guarantee that no negative-meth-diff-scoring elements are within 500 bases of either of them.

Since this modified approach gives you all the score columns, you can read these into R/Python/etc. and generate your weighted mean from the desired columns.

One approach is to use the | delimiter to split up grouped elements into two (or more) separate data frames (say you are using R). You may need more than two frames, if it turns out more than two filtered elements are close to each other and still far away from negative-meth elements. The above result is just an example of a possible outcome.

Once you have all the elements separated, loop through each row to access the frame columns you need from each of the 2..n data frames and calculate the weighted mean or other statistic.

view raw biostars-169576.md hosted with ❤ by GitHub