To demonstrate the annotation process I will describe, we need some genes. Let's assume you're working with human data and that you visit the GENCODE site to grab GENCODE records in GTF format.
You can use BEDOPS gtf2bed
to convert this to a sorted UCSC BED-formatted file containing GENCODE genes. For example, here is how to get GENCODE v19 gene annotations for genome build hg19
:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
| gunzip -c - \
| gtf2bed - \
| grep "\tgene\t" \
> gencode.v19.annotation.gtf.genes.bed
Now that you have both data sets in BED format, you can now map your BED file to the GENCODE annotations with BEDOPS bedmap
, which returns genes that overlap your elements by one or more bases, e.g.:
$ bedmap --echo --echo-map-id-uniq elements.bed gencode.v19.annotation.gtf.genes.bed > answer.bed
The file answer.bed
contains the annotation results. This file is a sorted BED file that contains lines of the format:
[ transcript1 ] | [ gene1-overlapping-transcript1 ] ; ... ; [ geneA-overlapping-transcript1 ]
[ transcript2 ] | [ gene1-overlapping-transcript2 ] ; ... ; [ geneB-overlapping-transcript2 ]
...
[ transcriptN ] | [ gene1-overlapping-transcriptN ] ; ... ; [ geneZ-overlapping-transcriptN ]
The pipe character (|
) delimits BED-formatted transcript elements from the IDs or names of overlapping genes. The semi-colon (;
) delimits names of overlapping genes for a given transcript.
You can replace these delimiters if you need to with the --delim
and --multidelim
operators, but basically this lets you easily post-process the results with Perl, Python, awk
— whatever scripting or other language you prefer.
If you want the entire gene record and not just the gene name, use --echo-map
in place of --echo-map-id-uniq
. Other --echo-map-*
operators are available to recover different attributes of the mapped element.
The default criteria for overlap between transcript and mapped gene is one or more bases. If you want this to be more stringent, review the overlap criteria section of the BEDOPS bedmap
documentation.
Try annotatePeaks.pl from the HOMER suite.
annotatePeaks.pl $BED hg19 > $OUTPUT
The output off:
is:
Thank you :).
Do you have a question about this output?
It doesn't look like you installed HOMER or some of its dependencies appropriately. Go back to the HOMER installation page and follow it carefully.
I will reread the instructions for installation on ubuntu 14.04, 64-bit. Thank you :).
Read the output:
It can't find these scripts. They are probably in the same /bin directory as annotatePeaks.pl. If so, you need to add this directory to your $PATH.