Question

BEDOPS bedmap behavior

0

Entering edit mode

6.1 years ago

Vitis ★ 2.6k

I'm trying to run the "bedmap --mean" function to generate mean coverage in genomic windows.

My "ref.bed" looks like this (10bp disjoined windows):

region1 1   11
region1 11  21
region1 21  31
region1 31  41
region1 41  51
region1 51  61
region1 61  71
region1 71  81
region1 81  91
region1 91  101

My "map.bed" looks like this (bp-level sequencing coverage score, the first few regions have zero coverage but non-zero later):

region1 1   2   0
region1 2   3   0
region1 3   4   0
region1 4   5   0
region1 5   6   0
region1 6   7   0
region1 7   8   0
region1 8   9   0
region1 9   10  0
region1 10  11  0

The "bedmap --mean" is supposed to calculate the mean coverage in those 10bp windows. Instead, I only got one value for the first window and the rest reported "NAN":

region1 1   11  0.000000
region1 11  21  NAN
region1 21  31  NAN
region1 31  41  NAN
region1 41  51  NAN
region1 51  61  NAN
region1 61  71  NAN
region1 71  81  NAN
region1 81  91  NAN

This is the command I ran:

bedmap --faster --mean --delim "\t" --echo ref.bed map.bed

Both bed files are sorted by "sort -k1,1 -k2n,2". I also tried using 0-based system for the bed files but that didn't work, either.

I'm pretty new to bedops so could any bedops experts help me a bit here?

genome • 2.2k views

ADD COMMENT • link updated 6.1 years ago by Alex Reynolds 36k • written 6.1 years ago by Vitis ★ 2.6k

score 3 · Accepted Answer · 2019-03-30

(I'll start by assuming that signal in your map file is really in the fifth column, per UCSC specification. You might need to add a dummy ID in the fourth column, per spec, to get any map operation described here working. You could use awk to insert that field.)

You get NAN values here, because your (five-column) map file has no signal over regions after position 11 on the chromosome/contig region1, but you have reference regions past this point, for which bedmap is trying to calculate a mean signal. This is a correct and expected result.

You have a few options, depending on what you want to do with the output:

If you only want reference regions in the output where there was signal that can map to those regions, add --skip-unmapped to the bedmap statement.

Another option is to fill in the map file with regions with dummy values where you have no signal, such as zeros, if it is meaningful to do so. Mapping on this modified map file will then be able to return a signal operation result (mean, etc.).

A third option is to post-process the result from bedmap, piping to sed to substitute NAN with zero or another placeholder:

$ bedmap ... | sed 's/NAN/0.0/' > answer.bed

The only other suggestion I have is to use sort-bed (in the same toolkit as bedmap) in place of GNU sort, as it is faster at sorting whole-genome scale files.

A future version on bedmap will include an option for substituting NANs with other numbers, but I don't know when that will be available. Piping to sed is easy enough to do and doesn't really affect performance at all.