Unless I'm missing something about the question, here are a couple possibilities:
Perhaps use a pseudochromosome name for all records:
$ bigWigToBedGraph signal.bw signal.bg
$ awk -vFS="\t" -vOFS="\t" '{ $5 = $4; $4 = $1; $1 = "chrPseudo"; print $0; }' signal.bg | sort-bed - > signal.pseudoChr.bed
$ echo -e "chrPseudo\t0\t248956422" > chrPseudo.extents.bed
$ bedmap --echo --kth 0.9999 chrPseudo.extents.bed signal.pseudoChr.bed > answer.bed
I'm assuming that the ChIP-seq input comes from hg38
, which is why I create an "extents" file that extends over the longest chromosome in the real hg38
assembly, i.e., chr1
. Whatever intervals come out of that awk
statement, they will never extend past the longest, real chromosome. If you're not using hg38
, you could change 248956422
to match the longest chromosome in your assembly.
That's a pretty convoluted approach.
Another option might be to just strip all the signal from signal.bg
into its own text file, and run that file through R.
$ bigWigToBedGraph signal.bw signal.bg
$ cut -f4 signal.bg > signal.txt
$ R
> d <- read.table("signal.txt")
> quantile(d$V1, c(0.9999))
...some answer...
Running ?quantile
in R should give more detail about that command.
Even though I'd love for everyone to use BEDOPS, using R or even Python numpy.quantile
would probably be faster than rejigging a bigWig into a resorted BED file with a fake chromosome name.
Hope this helps!
I just wanted to say that your first option worked beautifully! (I already had everything set up to use bedops so it was easy to adapt to). Thank you so much for responding to my question :)