How To Determine Overlaps From Coordinates
4
0
Entering edit mode
11.5 years ago
k.nirmalraman ★ 1.1k

Hi,

Pretty naive.. I am starting a data analysis with two files of coordinates.

  1. Read start and End Position (35 bp in length)
  2. Gene start (may be) and end Position

I should now filter out for reads that overlap in the gene region. Over lap condition is like as shown in figure enter image description here

http://www.freeimagehosting.net/eebt2

I understand that it is simply possible to run two for loops and sort the overlapping and non-overlapping reads.... But any efficient method to do this would be appreciated.

Any pre-built packages in R?

rna seq read • 9.3k views
ADD COMMENT
0
Entering edit mode

have you heard of bedtools?

ADD REPLY
0
Entering edit mode

Thanks for quick response... Looks like it is possible with bedtools.. will be using it for the first time! Any quick directions?

ADD REPLY
6
Entering edit mode
11.5 years ago

Check the usage of intersectBed example-usage. This what you want. Just make two files in bed format, one the genes and other the input file for intersection.

Report the base-pair overlap between sequence alignments and genes.
$ intersectBed -a reads.bed -b genes.bed
Report whether each alignment overlaps one or more genes. If not, the alignment is not reported.
$ intersectBed -a reads.bed -b genes.bed -u
Report those alignments that overlap NO genes. Like "grep -v"
$ intersectBed -a reads.bed -b genes.bed -v
Report the number of genes that each alignment overlaps.
$ intersectBed -a reads.bed -b genes.bed -c
Report the entire, original alignment entry for each overlap with a gene.
$ intersectBed -a reads.bed -b genes.bed -wa
Report the entire, original gene entry for each overlap with a gene.
$ intersectBed -a reads.bed -b genes.bed -wb
Report the entire, original alignment and gene entries for each overlap.
$ intersectBed -a reads.bed -b genes.bed –wa -wb
Only report an overlap with a repeat if it spans at least 50% of the exon.
$ intersectBed -a exons.bed -b repeatMasker.bed –f 0.50
Only report an overlap if comprises 50% of the structural variant and 50% of the segmental duplication. Thus, it is reciprocally at least a 50% overlap.
$ intersectBed -a SV.bed -b segmentalDups.bed –f 0.50 -r
Read BED A from stdin. For example, find genes that overlap LINEs but not SINEs.
$ intersectBed -a genes.bed -b LINES.bed | intersectBed -a stdin -b SINEs.bed -v
Retain only single-end BAM alignments that overlap exons.
$ intersectBed -abam reads.bam -b exons.bed > reads.touchingExons.bam
Retain only single-end BAM alignments that do not overlap simple sequence repeats.
$ intersectBed -abam reads.bam -b SSRs.bed -v > reads.noSSRs.bam
ADD COMMENT
4
Entering edit mode
11.5 years ago

There are many tools available.

For [R]:

IRANGES:

http://www.bioconductor.org/packages/release/bioc/html/IRanges.html

ADD COMMENT
4
Entering edit mode
11.5 years ago

If you can put your data into UCSC BED format, the bedops tool in the BEDOPS suite will find overlaps between multiple (two or more) BED files.

There are a number of set operations available, but the --not-element-of operator may be most useful to you. For your application, for example, you might do the following:

$ bedops --not-element-of -1 Reads.bed Genes.bed

This reports all elements of Reads.bed which do not overlap ranges in Genes.bed by one or more bases. You can specify custom overlap criteria in either bases or percentage of length.

A couple advantages of bedops are that it supports multiple BED inputs and standard input. If you have separate files for genes, then you could do the following without an intermediate union:

$ bedops --not-element-of -1 Reads.bed UCSCGenes.bed GENCODEGenes.bed RefSeqGenes.bed ...

In the case of standard input support, you could very easily drop this into the middle of an extended processing pipeline and gain performance benefits from not having to generate intermediate data.

For example, a simple pipeline like this:

$ readGenerator foo bar baz | bedops --not-element-of -1 - Genes.bed > Answer.bed

is generally going to run somewhat faster than:

$ readGenerator foo bar baz > TempReads.bed 
$ bedops --not-element-of -1 TempReads.bed Genes.bed > Answer.bed
$ rm TempReads.bed
ADD COMMENT
1
Entering edit mode
11.5 years ago
Sandeep ▴ 260

If command line options seem complicated, you could try using Galaxy web server.

There is an option to join and also to check the coverage of genomic intervals.

ADD COMMENT

Login before adding your answer.

Traffic: 2583 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6