Merge the per-base signal (scores):
$ bedops --merge per-base-signal.bed > merged-pbs.bed
Invert the merged regions to get the "mirror" image. This "mirror" is the set of gaps in between merged stretches of signal.
Assuming hg38
, say:
$ fetchChromSizes hg38 | awk '{ print $1"\t0\t"$2; }' > hg38.bed
$ bedops --difference hg38.bed merged-pbs.bed > gaps-pbs.bed
This file gaps-pbs.bed
contains the gaps between merged stretches of signal, as measured over the bounds of reference genome hg38
.
Now map your TSS windows to those gaps with bedmap
, using the --echo-map-size
option to print out all the sizes of mapped elements (gaps). Pass this result to awk
to test if any of the gap sizes are larger than 500 bases.
If any such gap qualifies, just skip the current line and go to the next line. If no such gap is found, then the TSS window is "okay" and gets written to standard output:
$ bedmap --echo --echo-map-size --fraction-map 1 tss-windows.bed gaps-pbs.bed | awk -F'|' '{ flag=0; n=split($2, a, ";"); for(i=1; i<=n; i++) { if (a[i] >= 500) { flag=1; break; } } if (flag==0) { print $1; } }' > filtered-tss-windows.bed
Using --fraction-map 1
ensures that any overlapping gap will be entirely contained within the TSS window. Gaps which fall off the edge of the window are not considered. This setting can be adjusted to be more lenient, if desired.
The file filtered-tss-windows.bed
should now contain all TSS windows that do not entirely contain a gap greater than or equal to 500 bases in length.
The input file is a gzipped matrix with a json header, I don't think bedops will work for that.
I'm just explaining one way to do things, if your inputs can be set up correctly. You may need to convert files to sorted BED with
bigWigToWig
,wig2bed
,jq
,sort-bed
etc. to do set operations.Thanks very much guys!