You can use the bedmap
tool in the BEDOPS suite to quickly and efficiently generate an answer, if your reference and annotation inputs are both sorted BED files.
(Sorting is defined by using the sort-bed
or bbms
utilities, which apply a lexicographical sort on chromosome and coordinate. Sorting allows BEDOPS tools to operate in constant memory and linear time, which is very useful for whole-genome inputs. The sort-bed
utility works on input of sizes up to system memory, which can be a limitation for whole-genome work. The bbms
tool therefore applies a lexicographical sort to whole-genome scale data on what would otherwise be a limitation on personal workstations, while use of GNU sort
in this way may require some extra tricks. BEDOPS v2 will merge the functionality of sort-bed
and bbms
into sort-bed
in a more transparent manner.)
The bedmap
tool allows you to collect annotations, scores, positional or full element data from "map" elements in one UCSC-standard BED file, which overlap an element in a "reference" BED file.
To demonstrate, let's say your annotations are formatted as a BED file called annotations.bed
, like so:
chr1 1000 1500 annotation-1
chr1 4000 4300 annotation-2
chr1 6200 6450 annotation-3
chr1 9120 9675 annotation-4
Here, we are following UCSC's standard definition of a BED file, with the first column representing the chromosome name, the second and third columns the genomic coordinates, and the fourth column the ID or name of the annotation.
Let's say you want to do an ad-hoc search through this file, to find annotations on the chromosome chr1
which are contained within the half-open range [2000, 7000)
. You could use bedmap
as follows:
$ echo -e "chr1\t2000\t7000\treference-1" | bedmap --echo --echo-map-id - annotations.bed
chr1 2000 7000 reference-1|annotation-2;annotation-3
The result indicates that annotation-2
and annotation-3
are contained within the BED coordinate range [2000, 7000)
on chromosome chr1
.
The echo -e
command passes bedmap
that range via standard input, which bedmap
consumes with the -
hyphen character. The bedmap
tool then maps annotations to that input range and shows you these results.
Results are a semi-colon-delimited list of annotations for that range, taken from the ID (or "name") column of annotations.bed
. This is because we used the --echo-map-id
operator, which outputs mapped IDs.
There are four --echo-map-*
operators that offer the entire mapped annotation (similar to bedTools), or various pieces of it (ID, genomic range, scores) for parsing convenience. Please refer to the documentation for more detail.
Instead of a trivial example that uses echo
and standard input, you can instead specify a BED file containing multiple ranges, each on their own line. For example, let's say we have a file called rangesOfInterest.bed
that contains two ranges (you can specify as many ranges as you like, each on their own line):
chr1 2000 7000 reference-1
chr1 5000 10000 reference-2
(This example only has four columns — chromosome name, start and stop coordinates, and the reference ID/name — but you can have additional columns, like score and strand data, and the operations will work identically as described.)
You can then use bedmap
as follows to get results containing the annotations of interest:
$ bedmap --echo --echo-map-id rangesOfInterest.bed annotations.bed
chr1 2000 7000 reference-1|annotation-2;annotation-3
chr1 5000 10000 reference-2|annotation-3;annotation-4
Further, if you want BED-formatted output, use the --delim
operator in conjunction with --echo
, as follows:
$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed
This replaces the pipe character (|
) with a tab (\t
), so that the output is a minimal BED-formatted result:
$ bedmap --echo --echo-map-id --delim '\t' rangesOfInterest.bed annotations.bed
chr1 2000 7000 reference-1 annotation-2;annotation-3
chr1 5000 10000 reference-2 annotation-3;annotation-4
This trick is very useful if you want to pipe the results to another BEDOPS command or another utility that consumes tab-delimited data, like awk
, cut
, Perl or Python scripts, etc.
For example, if your output needs to be something like: "Chr, Start, End, Name and Strand" along with annotations in a sixth column, then you would pipe UCSC-standard BED data into cut
to strip the score column from your UCSC-standard BED input:
$ bedmap ... | cut -f1-4,6,7 - > answer.bed
Note that your output is no longer a standard BED file, but BEDOPS tools that operate on coordinates will work with these data.
If you do not have UCSC-standard BED reference input and instead need to rearrange columns, simply pipe results into awk
to print out your columns in a different order, e.g.:
$ bedmap ... | awk '{print $1"\t"$2"\t"$3"\t"$5"\t"$6"\t"$7}' - > answer.bed
Your particular awk
statement will depend on which columns you decide to extract and print.
(BEDOPS tools are modular and written to accept standard input (such as what was done above with echo -e
) and submit results to standard output, so it is trivially easy to perform further set or mapping operations with bedops
and bedmap
, respectively, or compress the output with starch
, or pipe to bedTools utilities, etc. in an extended pipeline. Piping standard output from one program to standard input of another program reduces overall I/O load on a computer and can improve the speed of calculations, in addition to the performance enhancements already in BEDOPS.)
Thanks. I posted an answer using bedtools.
Could you provide a specific example (with example data) of what you are trying to do?
Now, I have added the example....