Question

find overlapping sequences and report min start and max end positions

0

Entering edit mode

6.5 years ago

onkar ▴ 10

I have a fasta/bed file in which there are some sequences which overlap with each other.

for eg:

1   57267   59067

1   63165   63758

1   63298   64137

1   67285   67596

here you can notice "1 63298 64137" is overlapping with "1 63165 63758". Now I want to obtain one file where minimum start position and maximun end position of these are reported instead of all these regions expected output:

1   57267   59067

1   63165   64137

1   67285   67596

this is just an example from that file. there are some locations where 4-5 sequence regions overlaps.

Kindly help

RNA-Seq R next-gen sequence • 1.5k views

ADD COMMENT • link updated 6.5 years ago by GenoMax 146k • written 6.5 years ago by onkar ▴ 10

score 3 · Answer 1 · 2018-04-20

3

Entering edit mode

6.5 years ago

Nicolas Rosewick 11k

In R use GenomicRanges and its reduce fonction :

a <- read.table("file.bed",sep="\t",as.is=T,header=F)
colnames(a) <- c("chr","start","end")
a.gr <- GRanges( a$chr,IRanges(a$start,a$end))
# use reduce() to merge overlapping ranges
res <-  reduce( a.gr) 
write.table("out.bed",as.data.frame(res),col.names=F,row.names=F,quote=F,sep="\t")

Here is the result :

   1 57267 59067  1801      *
   1 63165 64137   973      *
   1 67285 67596   312      *

ADD COMMENT • link 6.0 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Thank you Nicolas for this suggestion. It worked

I found one more solution which was quite helpful.

bedtools merge -i input.bed -c 4 -o collapse > merged_unique.bed

Hope this will help others too.

ADD REPLY • link 6.5 years ago by onkar ▴ 10

0

Entering edit mode

As you tagged the question with the R tag I guessed it should be in R. But using bedools is also good

ADD REPLY • link 6.5 years ago by Nicolas Rosewick 11k