How To Make Intervals In A Range Of Numbers?
6
3
Entering edit mode
12.0 years ago

I have a question related on how I could use a range given to split it in intervals of 100 length max. I'm looking for doing this in AWK, Shell Script, Perl or Python. Just to show a small example, I have a tab-delimited file that contains:

  • Chromosome Number (1 to 23)
  • Start range
  • End range
  • ID of the range
  • Number indicating how many times the range have to be splitted

This is how it looks:

chr1    37850      38536      MACS_peak_1    7
chr1    820769     821857     MACS_peak_4    11
chr1    5018483    5019041    MACS_peak_9    6

And after taking each range, and split them in 100 intervals, the output should look like this. The last two columns are not important as much as obtain the ranges splitted, and I showed for this example to indicate the number of split and the length of the split, respectively:

chr1    37850    37950    1    100
chr1    37951    38051    2    100
chr1    38052    38152    3    100
chr1    38153    38253    4    100
chr1    38254    38354    5    100
chr1    38355    38455    6    100
chr1    38456    38536    7    80
chr1    820769    820869    1    100
chr1    820870    820970        2    100
chr1    820971    821071        3    100
chr1    821072    821172        4    100
chr1    821173    821273        5    100
chr1    821274    821374        6    100
chr1    821375    821475        7    100
chr1    821476    821576        8    100
chr1    821577    821677        9    100
chr1    821678    821778        10    100
chr1    821779    821857        11    78
chr1    5018483    5018583    1    100
chr1    5018584    5018684    2    100
chr1    5018685    5018785    3    100
chr1    5018786    5018886    4    100
chr1    5018887    5018987    5    100
chr1    5018988    5019041    6    53
perl python awk chip-seq • 14k views
ADD COMMENT
1
Entering edit mode

What is it you are trying to do? Summarize/bin data on genomic positions in windows/intervals? Then I suggest you use bedtools makewindows and then bedtools map to map you data on the genomic intervals

ADD REPLY
0
Entering edit mode

seems that we have a whitespace bug that we'll be fixing shortly

ADD REPLY
0
Entering edit mode

I found that tab delimited appears to not be recognized so I had to put the example in 4 space-delimited format

ADD REPLY
8
Entering edit mode
12.0 years ago

In Python:

fh = open('input_file.txt')
out = open('output_file.txt','w')

def chunk_range(x1, x2, numb):
    size = ((x2 - x1 + 1 ) / numb) + 1
    while x1 <= x2:
        next = x1 + size
        yield x1, min(next - 1, x2)
        x1 = next

for line in fh:
    line = line.split()
    chr = line[0]
    x1 = int(line[1])
    x2 = int(line[2])
    numb = int(line[4])
    n = 1
    for st, end in chunk_range(x1,x2,numb):
        out.write('%s\t%i\t%i\t%i\t%i\n' % (chr, st, end, n, end-st))
        n+=1

fh.close()
out.close()

input_file.txt

chr1    37850      38536      MACS_peak_1    7
chr1    820769     821857     MACS_peak_4    11
chr1    5018483    5019041    MACS_peak_9    6

output_file.txt

chr1    37850       37948       1    98
chr1    37949       38047       2    98
chr1    38048       38146       3    98
chr1    38147       38245       4    98
chr1    38246       38344       5    98
chr1    38345       38443       6    98
chr1    38444       38536       7    92
chr1    820769      820868     1    99
chr1    820869      820968     2    99
chr1    820969      821068     3    99
chr1    821069      821168     4    99
chr1    821169      821268     5    99
chr1    821269      821368     6    99
chr1    821369      821468     7    99
chr1    821469      821568     8    99
chr1    821569      821668     9    99
chr1    821669      821768    10    99
chr1    821769      821857    11    88
chr1    5018483    5018576    1    93
chr1    5018577    5018670    2    93
chr1    5018671    5018764    3    93
chr1    5018765    5018858    4    93
chr1    5018859    5018952    5    93
chr1    5018953    5019041    6    88
ADD COMMENT
5
Entering edit mode
12.0 years ago

My solution using R, the input file according to me is the bedGraphs file (or a coverage file obtained from the read file in bed format)

# creating a test data set
data=data.frame(chrom=rep("chr1",2),start=c(37850,820769),end=c(38536,821857),tmp=rep("macsPeak",2),cov=c(7,11))

# or reading the whole file in, remove the head in next line to get it working and put hash in previous
#data=read.csv(commandArgs(TRUE)[1],sep='\t',header=FALSE)

# dividing the coverage (last column of peak file) into divisors of 100 for the whole file
tmp=sapply(data[,3]-data[,2],function(x)c(rep(100,floor(x/100)),x-floor(x/100)*100))

# splitting the file
covList=lapply(1:length(data[,1]),
   function(i){
     data.frame(chrom=rep(data[i,1],data[i,5]),
              start=data[i,2]+c(0,seq(100,(data[i,5]-1)*100,by=101)+1,tail(seq(100,(data[i,5]-1)*100,by=101)+1,1)+101),
              end=data[i,2]+c(0,seq(100,(data[i,5]-1)*100,by=101)+1,(tail(seq(100,(data[i,5]-1)*100,by=101)+1,1)+101)-(data[i,5]-1))+tmp[[i]],
              seq=seq(1,data[i,5]),reads=c(tmp[[i]][-length(tmp[[i]])],(tail(tmp[[i]],1)-data[i,5])+1))
     }
   )

# merging the data frames obtained in last step representing the each line of input file
final=do.call(rbind,lapply(1:length(covList),function(i)if(length(covList[[i]])>1)cbind(covList[[i]])))

# writing the file out
write.table(final,'bedGraph_expanded.tsv',sep='\t',row.names=F,quote=F)

Output

chrom  start    end seq reads
1   chr1  37850  37950   1   100
2   chr1  37951  38051   2   100
3   chr1  38052  38152   3   100
4   chr1  38153  38253   4   100
5   chr1  38254  38354   5   100
6   chr1  38355  38455   6   100
7   chr1  38456  38536   7    80
8   chr1 820769 820869   1   100
9   chr1 820870 820970   2   100
10  chr1 820971 821071   3   100
11  chr1 821072 821172   4   100
12  chr1 821173 821273   5   100
13  chr1 821274 821374   6   100
14  chr1 821375 821475   7   100
15  chr1 821476 821576   8   100
16  chr1 821577 821677   9   100
17  chr1 821678 821778  10   100
18  chr1 821779 821857  11    78

You can save the whole script and use as Rscript bedGraph2bed.R inputFile.bed

Cheers

ADD COMMENT
4
Entering edit mode
12.0 years ago
qiyunzhu ▴ 430

Here's my Perl solution:

my $input = "
chr1    37850      38536      MACS_peak_1    7
chr1    820769     821857     MACS_peak_4    11
chr1    5018483    5019041    MACS_peak_9    6
";

foreach (split (/\n/, $input)){
    next unless $_;
    my ($chr, $start, $end, $id, $times) = split (/\s+/);
    my $i = 1;
    for ($i=1; $i<$times; $i++){
        print $chr."\t".($start+($i-1)*101)."\t".($start+$i*101)."\t$i\t100\n";
    }
    print $chr."\t".($start+($i-1)*101)."\t$end\t$i\t".($end-$start-($i-1)*101)."\n";
}

The output should be exactly the same as in your question.

And here's a Perl code to make the intervals even:

my $input = "
chr1    37850      38536      MACS_peak_1    7
chr1    820769     821857     MACS_peak_4    11
chr1    5018483    5019041    MACS_peak_9    6
";

foreach (split (/\n/, $input)){
    next unless $_;
    my ($chr, $start, $end, $id, $times) = split (/\s+/);
    my $range = int(($end-$start)/$times);
    foreach (1..$times-1){
        print $chr."\t".($start+$range*($_-1))."\t".($start+$range*$_-1)."\t$_\t".($range-1)."\n";
    }
    print $chr."\t".($start+$range*($times-1))."\t$end\t$times\t".($end-$start-$range*($times-1)-1)."\n";
}
ADD COMMENT
4
Entering edit mode
12.0 years ago

bedtools has a utility called makewindows that will do exactly this. It seems that the 5th column in your file is superfluous, as you are really just making 100bp windows from the original intervals. One note in your example is that your start coordinates are one higher than they should be as BED start coordinates are 0-based, whereas the end coordinates are 1-based. I hope this helps.

bedtools makewindows -b data.bed -w 100 -i winnum | awk -v OFS="\t" '{print $0,$3-$2}'
chr1    37850    37950    1    100
chr1    37950    38050    2    100
chr1    38050    38150    3    100
chr1    38150    38250    4    100
chr1    38250    38350    5    100
chr1    38350    38450    6    100
chr1    38450    38536    7    86
chr1    820769    820869    1    100
chr1    820869    820969    2    100
chr1    820969    821069    3    100
chr1    821069    821169    4    100
chr1    821169    821269    5    100
chr1    821269    821369    6    100
chr1    821369    821469    7    100
chr1    821469    821569    8    100
chr1    821569    821669    9    100
chr1    821669    821769    10    100
chr1    821769    821857    11    88
chr1    5018483    5018583    1    100
chr1    5018583    5018683    2    100
chr1    5018683    5018783    3    100
chr1    5018783    5018883    4    100
chr1    5018883    5018983    5    100
chr1    5018983    5019041    6    58

if you really meant to skip one base between each interval, you'll need to change the awk statement. But I think this is what you want. As a control, load your file into UCSC and see what the windows look like.

ADD COMMENT
0
Entering edit mode

The only problem is that the next line is not offset from the last by one. You may want to re-check the expected output

ADD REPLY
0
Entering edit mode

My hunch is that the OP didn't intend to create BED windows that skip a base between each window. If intentional, a very minor change to the awk statement will solve this. Updated to demonstrate.

ADD REPLY
0
Entering edit mode

It's sounds like you're probably correct - you see, I don't have any experience with BED files or bedtools. However, I don't think the minor change that you've made to your awk command would work as intended. Did you test the code?

ADD REPLY
0
Entering edit mode

Right you are. Removed the edit under the assumption that this is what they want.

ADD REPLY
3
Entering edit mode
12.0 years ago
steve ▴ 40

I have followed this from here:

http://stackoverflow.com/questions/13551174/how-to-make-intervals-in-a-range-of-numbers

Here's one way using awk:

awk -v num=100 '{ for (i=1;i<=$NF;i++) { print $1, $2, var=(i==$NF?$3:$2+=num), i, (var-$2==0?"100":var-$2); $2++ } }' file

Results:

chr1 37850 37950 1 100
chr1 37951 38051 2 100
chr1 38052 38152 3 100
chr1 38153 38253 4 100
chr1 38254 38354 5 100
chr1 38355 38455 6 100
chr1 38456 38536 7 80
chr1 820769 820869 1 100
chr1 820870 820970 2 100
chr1 820971 821071 3 100
chr1 821072 821172 4 100
chr1 821173 821273 5 100
chr1 821274 821374 6 100
chr1 821375 821475 7 100
chr1 821476 821576 8 100
chr1 821577 821677 9 100
chr1 821678 821778 10 100
chr1 821779 821857 11 78
chr1 5018483 5018583 1 100
chr1 5018584 5018684 2 100
chr1 5018685 5018785 3 100
chr1 5018786 5018886 4 100
chr1 5018887 5018987 5 100
chr1 5018988 5019041 6 53
ADD COMMENT
0
Entering edit mode

+1 Awk is really powerful :)

ADD REPLY
0
Entering edit mode
12.0 years ago

Thanks everyone for your help! it worked!!!! :)

ADD COMMENT
0
Entering edit mode

This isn't an answer, so you should put it as a comment in your original question and delete this "answer". Glad you found a solution to your problem!

ADD REPLY

Login before adding your answer.

Traffic: 1204 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6