find overlapping sequences and report min start and max end positions
1
0
Entering edit mode
6.6 years ago
onkar ▴ 10

I have a fasta/bed file in which there are some sequences which overlap with each other.

for eg:

1   57267   59067

1   63165   63758

1   63298   64137

1   67285   67596

here you can notice "1 63298 64137" is overlapping with "1 63165 63758". Now I want to obtain one file where minimum start position and maximun end position of these are reported instead of all these regions expected output:

1   57267   59067

1   63165   64137

1   67285   67596

this is just an example from that file. there are some locations where 4-5 sequence regions overlaps.

Kindly help

RNA-Seq R next-gen sequence • 1.6k views
ADD COMMENT
3
Entering edit mode
6.6 years ago

In R use GenomicRanges and its reduce fonction :

a <- read.table("file.bed",sep="\t",as.is=T,header=F)
colnames(a) <- c("chr","start","end")
a.gr <- GRanges( a$chr,IRanges(a$start,a$end))
# use reduce() to merge overlapping ranges
res <-  reduce( a.gr) 
write.table("out.bed",as.data.frame(res),col.names=F,row.names=F,quote=F,sep="\t")

Here is the result :

   1 57267 59067  1801      *
   1 63165 64137   973      *
   1 67285 67596   312      *
ADD COMMENT
0
Entering edit mode

Thank you Nicolas for this suggestion. It worked

I found one more solution which was quite helpful.

bedtools merge -i input.bed -c 4 -o collapse > merged_unique.bed

Hope this will help others too.

ADD REPLY
0
Entering edit mode

As you tagged the question with the R tag I guessed it should be in R. But using bedools is also good

ADD REPLY

Login before adding your answer.

Traffic: 1860 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6