Yes, this is what bedmap is for, mapping or relating one set of BED elements to another:
$ bedmap --echo --echo-map-id --delim '\t' genes.bed windows.bed > answer.bed
The result:
$ more answer.bed
chr1 1400 1500 geneA sample1;sample2
chr1 1900 2000 geneB sample1;sample2
To explain the bedmap statement, the --echo
option prints each gene from genes.bed
. The --echo-map-id
option prints the ID value of any mapped (overlapping) elements from windows.bed
. The --delim
option separates the gene and ID result with a tab character.
If you need the list of window IDs on separate lines, you can pipe the output to awk
or other scripting tools to do this. Here is an untested script to get it into your desired format:
$ bedmap --echo --echo-map-id --delim '\t' genes.bed windows.bed \
| awk '{ \
n_ids = split($5, ids, ";"); \
for (idx = 1; idx <= n_ids; idx++) { \
print $1"\t"$2"\t"$3"\t"$4"\t"ids[idx]; \
} \
}' - \
| sort-bed - \
> split_answer.bed
The result:
$ more answer.bed
chr1 1400 1500 geneA sample1
chr1 1400 1500 geneA sample2
chr1 1900 2000 geneB sample1
chr1 1900 2000 geneB sample2
Note the use of sort-bed at the end of the pipeline, to ensure the output is a sorted BED file that you can use with other BEDOPS tools. If you don't care about the sort order of the output, you can remove the sort-bed call.
Piping avoids making intermediate files, which is time-consuming, and so it is a good way to deal with very big BED files. You might first test things out with a small subset of your input files to make sure the pipeline works, then run your full input files through the working pipeline.
thanks that's perfect !