Annotating Genomic Intervals
6
11
Entering edit mode
12.2 years ago
Stephen 2.8k

How can I annotate human genomic intervals (BED file) from a ChIP-seq experiment with information such as whether the interval overlaps with a gene(s)? Upstream of a gene? Overlaps with an exon? Intron? 5kb upstream/downstream of TSS? Intergenic? Does it overlap with a DNAse I hypersensitive site?

Surely bedtools can help me with this, but I'm looking for the best workflow / data sources to use for this that will require the least amount of scripting.

Thanks.

bedtools annotation r bioconductor • 15k views
ADD COMMENT
14
Entering edit mode
12.2 years ago

If a lot of your data are in UCSC or three-column BED format, the BEDOPS suite contains tools that will help you with your work. These tools are quick and powerful, offering a few useful features:

  • support for standard input and output streams (you can pass in the results from one analysis to another, using standard UNIX pipes)
  • support for multiple BED inputs for set operations
  • low memory overhead

As examples, we can go through some of your to-do list below:

To find regions (Regions.bed) which overlap DNase hypersensitive sites (DHSes.bed) by one or more bases, the bedops tool performs this particular set operation:

$ bedops --element-of -1 Regions.bed DHSes.bed

The --element-of option supports other overlap criteria, and the bedops tool supports a variety of operations that include set complement, difference, intersection, merging, multiset union and others. See --help or the documentation for a fuller listing.

Using bedextract is a very fast way to perform bedops --element-of -1 A B calculations, if your set of DHS elements do not contain nested elements (see the documentation for their definition):

$ bedextract DHSes.bed Regions.bed

(This tool uses the information in a sorted BED file to locate elements in logarithmic time, instead of linearly walking through the input. The speedup can be very significant, as a result. A lot of BED inputs can take advantage of the performance gain that bedextract provides, so this is worth investigating.)

To find the closest genes (Genes.bed) to a set of input regions (Regions.bed), use closest-features (documentation here):

$ closest-features Regions.bed Genes.bed

To get a delimited set of exons which overlap a set of regions, use bedmap (documentation here):

$ bedmap --echo --echo-map Regions.bed Exons.bed

The bedmap tool has options that allow refinement of overlap criteria, as well as reporting of other data that can be found in a typical UCSC BED file, such as score and ID information in the mapping input (in this example, the exons). This would be useful for data that contain scores (DHS density tracks, etc.) or IDs (exons, genes, SNPs, etc.).

Additionally, calculations from one tool can be piped into another BEDOPS tool, using standard UNIX pipes. UNIX pipes are a good way to "glue together" calculations, which minimizes time spent on scripting and script maintenance. Piping can also reduce file I/O, which can improve overall performance considerably.

As a further example, let's say you are interested in the set of nearest genes to BED elements that overlap DHSes. To get this answer, we can pipe the results of a bedops operation into a closest-features search:

$ bedops --element-of -1 Regions.bed DHSes.bed | closest-features - Genes.bed

(BEDOPS shares the GNU convention of using the hyphen to denote standard input.)

The bedops tool supports multiple inputs. So if you have several sets of DHSes in separate BED files, you can find overlaps between input regions and all of them in one go:

$ bedops --element-of -1 Regions.bed DHSesFirstSet.bed DHSesSecondSet.bed ... DHSesLastSet.bed

The BEDOPS toolkit aims to have a low memory overhead and to perform quickly, taking advantage of the structure of lexicographically-sorted BED streams. The sort-bed utility will do this, if inputs are unsorted:

$ sort-bed Unsorted.bed > Sorted.bed

Other BEDOPS tools (bedops, bedextract, etc.) import and export sorted data, so sorting is only needed once at the beginning of analysis, if the inputs are unsorted or are of uncertain order. Some tools also include error checking options (--ec), to help validate BED input.

ADD COMMENT
0
Entering edit mode

Thanks for the detailed how-to. I will definitely give this a shot.

ADD REPLY
7
Entering edit mode
12.2 years ago

If you're comfortable with R and Bioconductor, the ChIPpeakAnno package has some canned functionality to do this -- the annotatePeakInBatch function, in particular.

Check out Section 2: Examples of using ChIPPeakAnno, for an example.

ADD COMMENT
0
Entering edit mode

Thanks, very comfortable in R/BioC. Edited question to include R/BioC tags. Will take a look at this package. Thanks.

ADD REPLY
0
Entering edit mode

ChIPpeakAnno is not using strand information therefore introducing error. Use the more recent ChIPseeker R package which essentially does the same, or the perl program HOMER

ADD REPLY
2
Entering edit mode
12.2 years ago

Another option is Ensembl Region Report. API here.

ADD COMMENT
2
Entering edit mode
12.2 years ago

Bioconductor tools of interest include rtracklayer and GenomicFeatures to gather genomic feature information and GenomicRanges and IRanges for doing very fast overlap and range manipulation. In particular, you'll want to look at findOverlaps and range operations like reduce, for example. Combinations of these can be an awesome toolset for nearly any genomic range operations including filtering and modifying, all from the comfort of the R command line....

ADD COMMENT
1
Entering edit mode
12.2 years ago
vu.bush ▴ 10

I would use BedIntersect -- it's an awesome tool and can compute the overlap between two bed files. You can find it in the UCSC FTP directory

ADD COMMENT
1
Entering edit mode

Thanks. I couldn't seem to find BedIntersect. Is this different from the intersectBed function in the bedtools package? I'm aware of how to use intersectBed, but it's more so finding a consolidated data source and workflow to get at this annotation with the least amount of scripting (that has likely already been done by someone else). If BedIntersect has this kind of annotation data and functionality built-in, I'd love to know where to find it.

ADD REPLY
0
Entering edit mode
12.2 years ago

A Biodas server could be a way. I've just tested the Biodas server of the UCSC

$ curl -s "http://genome.ucsc.edu/cgi-bin/das/hg19/features?segment=1:100000,100100" | head -n 50


http://www.biodas.org/dtd/dasgff.dtd">
<DASGFF>
<GFF version="1.0" href="&lt;a href=" http:="" genome.ucsc.edu="" cgi-bin="" das="" hg19="" features"="" rel="nofollow">http://genome.ucsc.edu/cgi-bin/das/hg19/features">
<SEGMENT id="1" start="100000" stop="100100" version="1.00" label="1">
<FEATURE id="HIT000010175.chr1.73842.0" label="HIT000010175">
 <TYPE id="HInvGeneMrna" category="transcription" reference="no">HInvGeneMrna</TYPE>
 <METHOD id="id">BLAT</METHOD>
 <START>73843</START>
 <END>74474</END>
 <SCORE>977</SCORE>
 <ORIENTATION>+</ORIENTATION>
 <PHASE>-</PHASE>
 <GROUP id="HIT000010175.chr1.73842">

unfortunately , it is broken:

$ curl -s "http://genome.ucsc.edu/cgi-bin/das/hg19/features?segment=1:100000,100100" | xmllint --format -
-:402: parser error : Premature end of data in tag SEGMENT line 5

^
-:402: parser error : Premature end of data in tag GFF line 4

^
-:402: parser error : Premature end of data in tag DASGFF line 3

^

and it is poorly documented:

BIODAS : can i retrieve some metadata about a SEGMENT ?

but you can find some other DAS server elsewhere:

http://www.dasregistry.org/listSources.jsp

ADD COMMENT

Login before adding your answer.

Traffic: 1875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6