Question

split bed file with specific ranges

0

Entering edit mode

4.8 years ago

Mehmet ▴ 820

Dear all,

I have a bed file below. I want to split the bed file based on base length (3 kb) between the start and the end position. For example, from the start position 12109 to the end position 14678 should be in one file, as these are in 3kb range.the start position 15573 and the end position 15612 should be in another file and so on.

Sp_chr1 12109   12149 DNA Sequences 
Sp_chr1 12348   12388 DNA Sequences 
Sp_chr1 12493   12533 DNA Sequences 
Sp_chr1 12616   12656 DNA Sequences 
Sp_chr1 12746   12786 DNA Sequences 
Sp_chr1 14486   14521 DNA Sequences 
Sp_chr1 14525   14564 DNA Sequences  
Sp_chr1 14638   14678 DNA Sequences 
Sp_chr1 15573   15612 DNA Sequences 
Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences 
Sp_chr1 25346   25386 DNA Sequences 
Sp_chr1 26053   26093 DNA Sequences 
Sp_chr1 26129   26169 DNA Sequences 
Sp_chr1 27874   27913 DNA Sequences

sequence genome gene next-gen • 1.5k views

ADD COMMENT • link updated 4.7 years ago by Pierre Lindenbaum 164k • written 4.8 years ago by Mehmet ▴ 820

0

Entering edit mode

Unclear. Please add a representative output.

ADD REPLY • link 4.7 years ago by ATpoint 85k

0

Entering edit mode

Hi, 2999 bp length from the start position to the end is the splitting condition.

The first file should include these:

Sp_chr1 12109   12149 DNA Sequences 
Sp_chr1 12348   12388 DNA Sequences 
Sp_chr1 12493   12533 DNA Sequences 
Sp_chr1 12616   12656 DNA Sequences 
Sp_chr1 12746   12786 DNA Sequences 
Sp_chr1 14486   14521 DNA Sequences 
Sp_chr1 14525   14564 DNA Sequences  
Sp_chr1 14638   14678 DNA Sequences

The second file should include this:

Sp_chr1 15573   15612 DNA Sequences

The third file should include these:

Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences

ADD REPLY • link 4.7 years ago by Mehmet ▴ 820

0

Entering edit mode

The second file should include this: Sp_chr1 15573 15612 DNA Sequences

I don't get it : 15612-15573 = 39 , you could have added the next lines.

ADD REPLY • link 4.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

sorry for confusing. 15573 plus 2999 equals 18572, but next start position is 20498. Therefore, below ones should be separated.

Sp_chr1 15573   15612 DNA Sequences

and

Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences

ADD REPLY • link 4.7 years ago by Mehmet ▴ 820

0

Entering edit mode

4.8 years ago

Pierre Lindenbaum 164k

I wrote http://lindenb.github.io/jvarkit/BedCluster.html

run it with the option

-S, --size
  number of bases max per bin.

nevertheless the 'bin' stop being filled when the cumulative size is greater than the given size. Furthermore, the 4th column will be lost.

ADD COMMENT • link 4.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi,

I tried but it did not produce what I need.

ADD REPLY • link 4.7 years ago by Mehmet ▴ 820

score 2 · Accepted Answer · 2020-03-02

2

Entering edit mode

4.7 years ago

Pierre Lindenbaum 164k

second anwser according to your output:

rm -f tmp.*.bed && awk 'BEGIN{N=0;PREV=-1;} {B=int($2);if(PREV<0) PREV=B; if(B-PREV>3000) {close(out);PREV=B;N++;} out=sprintf("tmp.%d.bed",N); printf("%s\n",$0) >> out;}'  in.bed

ADD COMMENT • link 4.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi Pierre,

Thank you very much. This is what I needed to.

ADD REPLY • link 4.7 years ago by Mehmet ▴ 820

0

Entering edit mode

please validate the answer (green mark on the left) to close the question.

ADD REPLY • link 4.7 years ago by Pierre Lindenbaum 164k