One way to do this is with the BEDOPS bedmap
application and the --echo-map
operator.
First, you'd need to prepare your BED-like data (let's say it is a file called myData.bed
) to make it work with the chromosome naming scheme that Ensembl uses for its annotations:
$ sed -e 's/^Chr//' myData.bed > myFixedData.bed
This turns ChrN
record names into N
, which seems to follow Ensembl's naming convention. (You may want to use head
or similar on Ensembl records to see how chromosomes are named, and change this sed
statement accordingly.)
Secondly, grab your Ensembl annotations and turn them into BED files with the BEDOPS gtf2bed
conversion script. Let's say you're working with human data and want human annotations. The following command grabs Ensembl r71 annotations and turns them into a sorted BED file, ready for use with all BEDOPS tools:
$ wget -O - ftp://ftp.ensembl.org//pub/mnt2/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz \
| gunzip -c - \
| gtf2bed - \
> Homo_sapiens.GRCh37.71.bed
Finally, we're ready to map these Ensembl annotations to our regions of interest (myFixedData.bed
):
$ bedmap --echo --echo-map myFixedData.bed Homo_sapiens.GRCh37.71.bed > myAnnotatedData.bed
The file myAnnotatedData.bed
contains data in the following format:
[ myRegion-1 ] | [ annotation-1-1 ] ; ... ; [ annotation-1-i ]
...
[ myRegion-N ] | [ annotation-N-1 ] ; ... ; [ annotation-N-j ]
In other words, each region from 1..N
is printed, followed by a pipe character, and then a semi-colon-delimited list of human Ensembl annotations that overlap your region by one or more bases. Where there are no annotations, no data are printed for that region.
More information about --echo-map
and overlap options are in the bedmap
documentation.
If you just want certain fields from the annotations, you can post-process myAnnotatedData.bed
with a (for example) Perl, Python or awk
script, using the pipe, semi-colon and GTF delimiters to read through each annotation and pick out the fields of interest (e.g., transcript_id
, gene_id
, gene_biotype
etc.).
For example, you might try 'bedtools map' with the '-c' and '-o distinct' option.