BEDOPS bedmap
may be useful.
Take a look at bedmap --median
, which calculates median signal from elements that map over reference windows.
The documentation here explains in more detail: http://bedops.readthedocs.org/en/latest/content/reference/statistics/bedmap.html
Here's your original data (say it is called expression.txt
):
Gene rpkm chr start stop
AT1G01010 4.18954 Chr1 3631 5899
AT1G01020 10.22902 Chr1 5928 8737
AT1G01030 1.99064 Chr1 11649 13714
AT1G01040 34.81453 Chr1 23146 31227
AT1G01050 56.88482 Chr1 31170 33153
AT1G01060 1.24325 Chr1 33379 37871
You could turn it into a sorted BED file like so:
$ tail +2 expression.txt \
| awk '{print tolower($3)"\t"$4"\t"$5"\t"$1"\t"$2}' - \
| sort-bed - \
> expression.bed
The file expression.bed
is your signal or map file in a bedmap
operation. It contains the score signal in the fifth column, per UCSC specification.
Your sliding windows are then put into a second BED file, which acts as the reference file in a bedmap
operation.
Given the bounds of chromosomes, you can easily make sliding windows across each chromosome with bedops
.
Say you are working with genome build hg19
, which has the following extents:
You can generate non-overlapping, 20 kbase sliding windows with the --chop
option against the extents file, like so:
$ bedops --chop 20000 hg19.extents.bed > sliding_windows.bed
The file sliding_windows.bed
contains all the sliding windows going across each chromosome in hg19
, in sorted BED format.
Now that you have the reference and map files, you can do the mapping operation with bedmap
:
$ bedmap --echo --median sliding_windows.bed expression.bed > answer.bed
The file answer.bed
contains each sliding window region (via the --echo
operator), and the median expression signal over that window (via the --median
operator).
Since it makes no meaningful statistical sense to calculate a median over a sample size of 0, consider adding the --skip-unmapped
operator to leave out windows over which there is no expression signal:
$ bedmap --echo --median --skip-unmapped sliding_windows.bed expression.bed > answer.bed
The revised file answer.bed
contains each sliding window region (via the --echo
operator), and the median expression signal over that window (via the --median
operator), only where there is any overlapping signal across the window (via the --skip-unmapped
operator).
Looks to work well, but for the end file it doesn't seem to be able to calculate the median of the rpkm for each window (I got a median of 0 for each window) after following your steps. I think it has something to do with the intersect line, as the result from that command looks like this:
If a window does not intersect any gene, as in the lines in your comment, then the median is zero as it should be. Make sure chromosome names are the same in your input files. Here you have chr1, in your original post you have Chr1.