The BEDOPS suite includes the closest-features
tool, which finds the nearest feature ("gene" from a set of "genes") from a reference input (individual "region"). Overlaps are allowed. Both inputs are UCSC BED-formatted and lexicographically sorted.
As well as file input, the closest-features
command accepts UNIX standard input. In other words, you can keep passing results off to the next search. I'll explain below how this makes it easy to find N closest genes.
First, the command:
$ closest-features Regions.bed Genes.bed
will return the closest two genes in Genes.bed
for each element in Regions.bed
. The format of this output is, generally (see --help
for options):
[ region element #1 ] | [ upstream element ] | [ downstream element ]
[ region element #2 ] | [ upstream element ] | [ downstream element ]
...
[ region element #n ] | [ upstream element ] | [ downstream element ]
Note the use of the |
delimiter to split the elements. We can make use of this when passing results to standard UNIX utilities.
The following command yields the first closest gene downstream of the region element, using awk
to split the result by the |
delimiter and print out the third item (downstream element):
$ closest-features Regions.bed Genes.bed \
| awk -F'|' '{print $3}' -
For sake of demonstration, I'm sticking to downstream items. Note that we can complicate things here if we want an upstream element, by either printing the second item, or using the --dist
option with closest-features
. This option returns distance in bases, along with the nearest "hits" up- and downstream. By passing those distances to the awk
statement, a simple conditional on the distance value could print either the second item (upstream hit) or third item (downstream hit).
Note that the output from this command — the first nearest gene — is in BED format, because it is an element in Genes.bed
! This is really useful, because this means we can pass this back into closest-features
as standard input, and it will try to find the nearest hit to that result.
Because closest-features
takes in reference regions in standard input, these searches can be chained together on the command line. This BEDOPS utility follows the GNU convention of using the hyphen to denote standard input.
In other words, to search for the second closest gene upstream, we pass the results of the first search as a reference region for closest-features
to search, using a standard UNIX pipe:
$ closest-features Regions.bed Genes.bed \
| awk -F'|' '{print $3}' - \
| closest-features - Genes.bed \
| awk -F'|' '{print $3}' -
And we find the third closest feature, by passing that result to another search:
$ closest-features Regions.bed Genes.bed \
| awk -F'|' '{print $3}' - \
| closest-features - Genes.bed \
| awk -F'|' '{print $3}' -
| closest-features - Genes.bed \
| awk -F'|' '{print $3}' -
And so on and so on, piping N-1 times for N closest features (in this case, features that are downstream of the input region element).
The output of each Nth iteration of this command is a BED-formatted genomic coordinate that represents the Nth-closest element ("gene") to the region-of-interest.
Let's use these chained commands to build files containing the first through fifth closest features (from Genes.bed
) to elements in Regions.bed
:
$ closest-features Regions.bed Genes.bed | awk -F'|' {print $3} - > 1.bed
$ closest-features Regions.bed Genes.bed | awk -F'|' {print $3} - | closest-features - Genes.bed | awk -F'|' {print $3} - > 2.bed
$ closest-features ... > 3.bed
$ closest-features ... > 4.bed
$ closest-features ... > 5.bed
(You could write a script to construct and execute these chained commands, in order to save time.)
Once you have the individual results, it's a simple matter of concatenating the region element with each result. We can use a simple UNIX paste
command to do this, e.g.:
$ paste -d'|' Regions.bed 1.bed 2.bed 3.bed 4.bed 5.bed > Results.bed
Each line of Results.bed
should look something like:
[ region-1 ] | [ first hit ] | [ second hit ] | ... | [ fifth hit ]
[ region-2 ] | [ first hit ] | [ second hit ] | ... | [ fifth hit ]
...
[ region-n ] | [ first hit ] | [ second hit ] | ... | [ fifth hit ]
Note that the inputs to closest-features
should be lexicographically-sorted. BEDOPS provides a tool called sort-bed
which does this. You only need to sort once, if unsorted:
$ sort-bed Regions.bed > Regions.sorted.bed
$ sort-bed Genes.bed > Genes.sorted.bed
$ closest-features Regions.sorted.bed Genes.sorted.bed
...
Hi Alex,
I am in trouble with closest-features command of bedops. Could you please share a sample "Regions.bed" and "Genes.bed" files as reference? I have Chromosome, Start and End information and need to find out up-and-down stream genes. Thanks a lot for your input.
Bahar
Take a look at the bottom of this page for sample inputs and documentation: http://bedops.readthedocs.org/en/latest/content/reference/set-operations/closest-features.html
I've gone through the documentation but could not help!
I have a Genes list .bed file as follows:
and Regions .bed file as follows:
Command I use is:
and no result...
Can you please tell me where the error is?
Thanks a lot Alex.
I can try to replicate any problems here, but first I need a reproducible example. Can you post your files somewhere for me to download, with their filenames matching the command you are using, and please post the exact command you are using with these files?
Hi Alex,
I could fix the issue and get my result.
Thanks a lot...