If a lot of your data are in UCSC or three-column BED format, the BEDOPS suite contains tools that will help you with your work. These tools are quick and powerful, offering a few useful features:
- support for standard input and output streams (you can pass in the results from one analysis to another, using standard UNIX pipes)
- support for multiple BED inputs for set operations
- low memory overhead
As examples, we can go through some of your to-do list below:
To find regions (Regions.bed
) which overlap DNase hypersensitive sites (DHSes.bed
) by one or more bases, the bedops
tool performs this particular set operation:
$ bedops --element-of -1 Regions.bed DHSes.bed
The --element-of
option supports other overlap criteria, and the bedops
tool supports a variety of operations that include set complement, difference, intersection, merging, multiset union and others. See --help
or the documentation for a fuller listing.
Using bedextract
is a very fast way to perform bedops --element-of -1 A B
calculations, if your set of DHS elements do not contain nested elements (see the documentation for their definition):
$ bedextract DHSes.bed Regions.bed
(This tool uses the information in a sorted BED file to locate elements in logarithmic time, instead of linearly walking through the input. The speedup can be very significant, as a result. A lot of BED inputs can take advantage of the performance gain that bedextract
provides, so this is worth investigating.)
To find the closest genes (Genes.bed
) to a set of input regions (Regions.bed
), use closest-features
(documentation here):
$ closest-features Regions.bed Genes.bed
To get a delimited set of exons which overlap a set of regions, use bedmap
(documentation here):
$ bedmap --echo --echo-map Regions.bed Exons.bed
The bedmap
tool has options that allow refinement of overlap criteria, as well as reporting of other data that can be found in a typical UCSC BED file, such as score and ID information in the mapping input (in this example, the exons). This would be useful for data that contain scores (DHS density tracks, etc.) or IDs (exons, genes, SNPs, etc.).
Additionally, calculations from one tool can be piped into another BEDOPS tool, using standard UNIX pipes. UNIX pipes are a good way to "glue together" calculations, which minimizes time spent on scripting and script maintenance. Piping can also reduce file I/O, which can improve overall performance considerably.
As a further example, let's say you are interested in the set of nearest genes to BED elements that overlap DHSes. To get this answer, we can pipe the results of a bedops
operation into a closest-features
search:
$ bedops --element-of -1 Regions.bed DHSes.bed | closest-features - Genes.bed
(BEDOPS shares the GNU convention of using the hyphen to denote standard input.)
The bedops
tool supports multiple inputs. So if you have several sets of DHSes in separate BED files, you can find overlaps between input regions and all of them in one go:
$ bedops --element-of -1 Regions.bed DHSesFirstSet.bed DHSesSecondSet.bed ... DHSesLastSet.bed
The BEDOPS toolkit aims to have a low memory overhead and to perform quickly, taking advantage of the structure of lexicographically-sorted BED streams. The sort-bed utility will do this, if inputs are unsorted:
$ sort-bed Unsorted.bed > Sorted.bed
Other BEDOPS tools (bedops
, bedextract
, etc.) import and export sorted data, so sorting is only needed once at the beginning of analysis, if the inputs are unsorted or are of uncertain order. Some tools also include error checking options (--ec
), to help validate BED input.
Thanks for the detailed how-to. I will definitely give this a shot.