Another way to do this is to use the BEDOPS bedmap
application to map genes to the domain defined outside your input regions.
First, pre-process your elements to define the bounds of flanking regions. For example, you might define flanking regions as +/- 1kb outside your input regions. Say you start with a BED file called myRegions.bed
:
$ more myRegions.bed
chr1 3044715 3044756
chr1 3095270 3095341
chr1 3100822 3100894
chr1 3167949 3168020
chr1 3205652 3205723
chr1 3684683 3684752
You could use a simple awk
script to generate your new flanking regions:
$ awk ' \
BEGIN { \
pad=1000;
} \
{ \
chr=$1; \
start=$2; \
stop=$3; \
print $1"\t"(start - pad)"\t"start"\n"$1"\t"stop"\t"(stop + pad); \
}' myRegions.bed \
| sort-bed - \
> myFlanks.bed
Note that we pass the output from awk
to BEDOPS sort-bed
, to ensure the output file (myFlanks.bed
) is a sorted BED file. Sorting prepares the output for use with BEDOPS tools and allows them to operate efficiently, compared with alternatives.
Now grab some genes and turn them into a BED file. We'll take Ensembl gene annotations for mm10
:
$ wget -O - ftp://ftp.ensembl.org/pub/mnt2/release-71/gtf/mus_musculus/Mus_musculus.GRCm38.71.gtf.gz \
| gunzip -c - \
| gtf2bed - \
| grep -w gene \
> mm10.genes.bed
We have two sorted, BED-formatted inputs: our +/- 1kb flanking regions and our gene annotations. We're now ready to use bedmap
to do mapping:
$ bedmap --echo --echo-map-range myFlanks.bed mm10.genes.bed > answer.bed
The file answer.bed
is a sorted BED file that contains your flanking regions and a semi-colon-delimited list of gene ranges which overlap the flanking region:
[ flank-1 ] | [ gene-1-1 ] ; ... ; [ gene-1-i ]
[ flank-2 ] | [ gene-2-1 ] ; ... ; [ gene-2-j ]
...
[ flank-N ] | [ gene-N-1 ] ; ... ; [ gene-N-k ]
Where no genes overlap a flanking region, nothing is printed. The default overlap parameter is one or more bases. You can make this setting more stringent by adjusting overlap criteria.
If you want the entire gene record — and not just the range — use --echo-map
in place of --echo-map-range
. Other mapping echo
options are available, depending on what you want bedmap
to report.
NOTE: You might consider complications of double- or multiple-counting of genes, where flanking regions overlap. In that case, you could use a --merge
set operation with BEDOPS bedops
to merge overlapping flanking regions, before running bedmap
. This would ensure that you only count or label genes once:
$ awk ' \
BEGIN { \
pad=1000;
} \
{ \
chr=$1; \
start=$2; \
stop=$3; \
print $1"\t"(start - pad)"\t"start"\n"$1"\t"stop"\t"(stop + pad); \
}' myRegions.bed \
| sort-bed - \
| bedops --merge - \
> myMergedFlanks.bed
A second approach is to use BEDOPS closest-features
, which would search for and report the nearest gene to the input region. Let's first sort your regions:
$ sort-bed myRegions.bed > mySortedRegions.bed
Grab the Ensembl mm10
records as described above.
Now we apply closest-features
on the sorted regions, to look for the closest Ensembl annotations:
$ closest-features --closest mySortedRegions.bed mm10.genes.bed > answer.bed
The file answer.bed
will contain the region element and the nearest gene to that element, either upstream or downstream, with ties given to the upstream element:
[ region-1 ] | [ nearest-gene-1 ]
[ region-2 ] | [ nearest-gene-2 ]
...
[ region-N ] | [ nearest-gene-N ]
If you add the --delim '\t'
option to closest-features
, you'll get back a sorted BED file. This is useful in that standard output from BEDOPS tools can be piped to other downstream tools that can accept BED-formatted standard input.
All BEDOPS tools are designed to operate in larger pipelines in this manner, which helps reduce file I/O and provide added performance benefits, in addition to speed and memory enhancements in the BEDOPS toolkit.
Please clarify your question.