Question

How To Determine Overlaps From Coordinates

0

Entering edit mode

11.5 years ago

k.nirmalraman ★ 1.1k

Hi,

Pretty naive.. I am starting a data analysis with two files of coordinates.

Read start and End Position (35 bp in length)
Gene start (may be) and end Position

I should now filter out for reads that overlap in the gene region. Over lap condition is like as shown in figure enter image description here

http://www.freeimagehosting.net/eebt2

I understand that it is simply possible to run two for loops and sort the overlapping and non-overlapping reads.... But any efficient method to do this would be appreciated.

Any pre-built packages in R?

rna seq read • 9.3k views

ADD COMMENT • link updated 11.5 years ago by Sandeep ▴ 260 • written 11.5 years ago by k.nirmalraman ★ 1.1k

0

Entering edit mode

have you heard of bedtools?

ADD REPLY • link 11.5 years ago by brentp 24k

0

Entering edit mode

Thanks for quick response... Looks like it is possible with bedtools.. will be using it for the first time! Any quick directions?

ADD REPLY • link 11.5 years ago by k.nirmalraman ★ 1.1k

score 6 · Answer 1 · 2012-11-05

Check the usage of intersectBed example-usage. This what you want. Just make two files in bed format, one the genes and other the input file for intersection.

Report the base-pair overlap between sequence alignments and genes.
$ intersectBed -a reads.bed -b genes.bed
Report whether each alignment overlaps one or more genes. If not, the alignment is not reported.
$ intersectBed -a reads.bed -b genes.bed -u
Report those alignments that overlap NO genes. Like "grep -v"
$ intersectBed -a reads.bed -b genes.bed -v
Report the number of genes that each alignment overlaps.
$ intersectBed -a reads.bed -b genes.bed -c
Report the entire, original alignment entry for each overlap with a gene.
$ intersectBed -a reads.bed -b genes.bed -wa
Report the entire, original gene entry for each overlap with a gene.
$ intersectBed -a reads.bed -b genes.bed -wb
Report the entire, original alignment and gene entries for each overlap.
$ intersectBed -a reads.bed -b genes.bed –wa -wb
Only report an overlap with a repeat if it spans at least 50% of the exon.
$ intersectBed -a exons.bed -b repeatMasker.bed –f 0.50
Only report an overlap if comprises 50% of the structural variant and 50% of the segmental duplication. Thus, it is reciprocally at least a 50% overlap.
$ intersectBed -a SV.bed -b segmentalDups.bed –f 0.50 -r
Read BED A from stdin. For example, find genes that overlap LINEs but not SINEs.
$ intersectBed -a genes.bed -b LINES.bed | intersectBed -a stdin -b SINEs.bed -v
Retain only single-end BAM alignments that overlap exons.
$ intersectBed -abam reads.bam -b exons.bed > reads.touchingExons.bam
Retain only single-end BAM alignments that do not overlap simple sequence repeats.
$ intersectBed -abam reads.bam -b SSRs.bed -v > reads.noSSRs.bam

score 4 · Answer 2 · 2012-11-05

4

Entering edit mode

11.5 years ago

Zev.Kronenberg 12k

There are many tools available.

For [R]:

IRANGES:

http://www.bioconductor.org/packages/release/bioc/html/IRanges.html

ADD COMMENT • link 11.5 years ago by Zev.Kronenberg 12k

score 4 · Answer 3 · 2012-11-05

If you can put your data into UCSC BED format, the bedops tool in the BEDOPS suite will find overlaps between multiple (two or more) BED files.

There are a number of set operations available, but the --not-element-of operator may be most useful to you. For your application, for example, you might do the following:

$ bedops --not-element-of -1 Reads.bed Genes.bed

This reports all elements of Reads.bed which do not overlap ranges in Genes.bed by one or more bases. You can specify custom overlap criteria in either bases or percentage of length.

A couple advantages of bedops are that it supports multiple BED inputs and standard input. If you have separate files for genes, then you could do the following without an intermediate union:

$ bedops --not-element-of -1 Reads.bed UCSCGenes.bed GENCODEGenes.bed RefSeqGenes.bed ...

In the case of standard input support, you could very easily drop this into the middle of an extended processing pipeline and gain performance benefits from not having to generate intermediate data.

For example, a simple pipeline like this:

$ readGenerator foo bar baz | bedops --not-element-of -1 - Genes.bed > Answer.bed

is generally going to run somewhat faster than:

$ readGenerator foo bar baz > TempReads.bed 
$ bedops --not-element-of -1 TempReads.bed Genes.bed > Answer.bed
$ rm TempReads.bed

score 1 · Answer 4 · 2012-11-07

1

Entering edit mode

11.5 years ago

Sandeep ▴ 260

If command line options seem complicated, you could try using Galaxy web server.

There is an option to join and also to check the coverage of genomic intervals.

ADD COMMENT • link 11.5 years ago by Sandeep ▴ 260