In the case of your two snippets, NAN
s will be returned because there are no overlaps between elements in the reference and map sets.
More generally, however, the --mean
operator is simply applied on the score data in map elements, without consideration for the length of the overlapping elements.
Consider the following ad-hoc example reference.bed
:
chr1 100 200
chr1 200 300
chr1 300 400
And the example map.bed
with two elements, one which is 15 bases long, and another 30 bases:
chr1 120 135 . 100
chr1 150 180 . 50
We can use --mean
with --echo-map-score
to clearly show how the statistic is calculated:
$ bedmap --echo --mean --echo-map-score ref.bed map.bed
chr1 100 200|75.000000|100.000000;50.000000
chr1 200 300|NAN|
chr1 300 400|NAN|
There are two scores reported back over the reference element chr1:100-200
, which are 100
and 50
, referring back to the elements in the map set. The average ("arithmetic mean") of those two scores is 75
. The lengths of overlaps are not used to calculate the statistic.
The other two reference elements report a mean of NAN
, as there are no elements in map.bed
that overlap those reference elements. You can use --skip-unmapped
to limit the result to only include reporting statistics where there are overlaps:
$ bedmap --skip-unmapped --echo --mean ref.bed map.bed
chr1 100 200|75.000000
If you need to take overlap length into account, one approach is to use the --wmean
option to apply a weighted arithmetic mean ("weighted mean").
Weights are derived from the extent of overlap of a map element with the reference element, relative to the overlaps of other map elements.
Let's go back to the demo, adding the --wmean
operator:
$ bedmap --echo --mean --wmean --echo-map-score ref.bed map.bed
chr1 100 200|75.000000|66.666667|100.000000;50.000000
chr1 200 300|NAN|NAN|
chr1 300 400|NAN|NAN|
The first reference element returns a weighted mean of 66.667
. The other two reference elements return unweighted and weighted arithmetic means of NAN
s (as expected).
If we go back to the lengths of overlaps between map and reference elements, we have a 15-base overlap with a score of 100, and a 30-base overlap with a score of 50.
The 30-base overlap is twice as long as the 15-base overlap, so the 30-base score is given twice the "weight" of the 15-base score:
(100 + 2*50) / (1 + 2*(1)) = (100 + 50 + 50) / 3 = 66.667
See --help
for a brief description of this option, as well as other statistics (--median
, --stdev
, --min
, --max
, etc.).
The correct statistic to use will, of course, depend on your input sets, the signal you are measuring, and whether you need to account for the length of overlaps between reference and map sets.
Another option is to chop up the map and reference sets into bins of equal size, and to interpolate or "zero" signal in the map set, where there are gaps. Tools like bedops --chop
and awk
can help here. The goal of using bins of equal size, regardless of the original element lengths, would be to have the mean signal (or whatever statistic) over all reference elements be calculated from vectors of equal lengths.
These and other normalization approaches can be useful for making sure that a calculated statistic doesn't unnecessarily derive from bias to one or more subsets, but this depends what you're trying to measure, ultimately.
Thanks a lot for such detailed explanation :).