Question

AWK: if value falls within a range, print sum of values

2

Entering edit mode

7.2 years ago

heso ▴ 40

I've got two sorted tab-delimited files:

input.txt includes chromosome, location, gene name, strandedness

 input.txt
    10      282035  282125  RNA1     -
    10      4134522 4134564 RNA1     -
    10      5299783 5299910 RNA2     -
    10      5900317 5900359 RNA1     -

ref2.txt includes read count, chromosome, location

ref.txt
          1 9       137792944
          1 9       137792945
          1 10      282074
          4 10      282095
          4 10      5900329

I want to print a sum on values IF certain criteria is met.

Namely:

IF ref$2==input$1

AND

ref$3 falls within a range of min==input$2 && max==input$3

Print input$0 and sum of ref$1 (as input$6) else print zero (as input$6) So the result should look like that:

10      282035  282125  RNA1     -  5
10      4134522 4134564 RNA1     -  0
10      5299783 5299910 RNA2     -  0
10      5900317 5900359 RNA1     -  4

This is what I came up with:

awk '
NR == FNR {min[NR]=$2; max[NR]=$3; chr[NR]=$1; next}
 {                
     for (id in min) 
         if (($2==chr[NR])&&(min[id] < $3 && $3 < max[id])) {
             print $0, sum+=$1
             break              
         }
}                                     
' input.txt ref.txt > output.txt

There's clearly something wrong here, since I don't get any output. Also, I'm still missing "else print zero".

Can somebody help me please?

awk • 4.2k views

ADD COMMENT • link updated 7.2 years ago by Alex Reynolds 36k • written 7.2 years ago by heso ▴ 40

score 3 · Answer 1 · 2017-09-11

3

Entering edit mode

7.2 years ago

Pierre Lindenbaum 164k

assuming tab is the delimiter and there is no leading spaces.

linux join is the tool for joining two files.

$ join -t $'\t' -1 1 -2 2 <(sort -t $'\t' -k1,1 input.txt) <(sort -t $'\t' -k2,2 ref2.txt) | uniq | awk -F '\t' '{line=sprintf("%s\t%s\t%s\t%s\t%s",$1,$2,$3,$4,$5);if($7>=$2 && $7<=$3) {a[line]+=int($6);} else {a[line]+=0;}} END {for(line in a) printf("%s\t%d\n",line,a[line]);}'

10  4134522 4134564 RNA1    -   0
10  282035  282125  RNA1    -   5
10  5299783 5299910 RNA2    -   0
10  5900317 5900359 RNA1    -   4

ADD COMMENT • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for the reply! I don't seem to get it to work though (no output). I think if($7>=$2 && $7<=$3) should rather be if($8>=$2 && $8<=$3) but even that does not solve the issue...

ADD REPLY • link 7.2 years ago by heso ▴ 40

score 3 · Answer 2 · 2017-09-11

This is a mathematical operation on intervals. You could use a toolkit dedicated to this purpose, like BEDOPS, which includes a tool called bedmap that lets you map — or associate — one file to another and do numerical and other operations, where there are associations.

With BEDOPS installed, you could convert both files to sorted BED and use bedmap --sum:

Sort the first file:

$ sort-bed input.txt > input.bed

Convert the second file into a five-column BED file and sort it:

$ awk -v OFS="\t" '{ print $2, $3, ($3+1), ".", $1 }' ref.txt | sort-bed - > ref.bed

Map the two files:

$ bedmap --echo --sum input.bed ref.bed > answer.bed

The --echo option reports each interval from input.bed, while the --sum operation sums the score data in the fifth column of ref.bed, where its intervals overlap an interval from input.bed.

While awk is awesome — I post lots of answers that use it — there is a better toolkit for this. If you think of your files as intervals and put them into that "shape", you can do these sorts of overlap calculations very quickly and correctly with the right tools.