split bed file with specific ranges
2
0
Entering edit mode
4.8 years ago
Mehmet ▴ 820

Dear all,

I have a bed file below. I want to split the bed file based on base length (3 kb) between the start and the end position. For example, from the start position 12109 to the end position 14678 should be in one file, as these are in 3kb range.the start position 15573 and the end position 15612 should be in another file and so on.

Sp_chr1 12109   12149 DNA Sequences 
Sp_chr1 12348   12388 DNA Sequences 
Sp_chr1 12493   12533 DNA Sequences 
Sp_chr1 12616   12656 DNA Sequences 
Sp_chr1 12746   12786 DNA Sequences 
Sp_chr1 14486   14521 DNA Sequences 
Sp_chr1 14525   14564 DNA Sequences  
Sp_chr1 14638   14678 DNA Sequences 
Sp_chr1 15573   15612 DNA Sequences 
Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences 
Sp_chr1 25346   25386 DNA Sequences 
Sp_chr1 26053   26093 DNA Sequences 
Sp_chr1 26129   26169 DNA Sequences 
Sp_chr1 27874   27913 DNA Sequences
sequence genome gene next-gen • 1.5k views
ADD COMMENT
0
Entering edit mode

Unclear. Please add a representative output.

ADD REPLY
0
Entering edit mode

Hi, 2999 bp length from the start position to the end is the splitting condition.

The first file should include these:

Sp_chr1 12109   12149 DNA Sequences 
Sp_chr1 12348   12388 DNA Sequences 
Sp_chr1 12493   12533 DNA Sequences 
Sp_chr1 12616   12656 DNA Sequences 
Sp_chr1 12746   12786 DNA Sequences 
Sp_chr1 14486   14521 DNA Sequences 
Sp_chr1 14525   14564 DNA Sequences  
Sp_chr1 14638   14678 DNA Sequences

The second file should include this:

Sp_chr1 15573   15612 DNA Sequences

The third file should include these:

Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences
ADD REPLY
0
Entering edit mode

The second file should include this: Sp_chr1 15573 15612 DNA Sequences

I don't get it : 15612-15573 = 39 , you could have added the next lines.

ADD REPLY
0
Entering edit mode

sorry for confusing. 15573 plus 2999 equals 18572, but next start position is 20498. Therefore, below ones should be separated.

Sp_chr1 15573   15612 DNA Sequences

and

Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences
ADD REPLY
2
Entering edit mode
4.8 years ago

second anwser according to your output:

rm -f tmp.*.bed && awk 'BEGIN{N=0;PREV=-1;} {B=int($2);if(PREV<0) PREV=B; if(B-PREV>3000) {close(out);PREV=B;N++;} out=sprintf("tmp.%d.bed",N); printf("%s\n",$0) >> out;}'  in.bed
ADD COMMENT
0
Entering edit mode

Hi Pierre,

Thank you very much. This is what I needed to.

ADD REPLY
0
Entering edit mode

please validate the answer (green mark on the left) to close the question.

ADD REPLY
0
Entering edit mode
4.8 years ago

I wrote http://lindenb.github.io/jvarkit/BedCluster.html

run it with the option

-S, --size
  number of bases max per bin.

nevertheless the 'bin' stop being filled when the cumulative size is greater than the given size. Furthermore, the 4th column will be lost.

ADD COMMENT
0
Entering edit mode

Hi,

I tried but it did not produce what I need.

ADD REPLY

Login before adding your answer.

Traffic: 2496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6