Question

Merge series and recognize end of it - AWK

0

Entering edit mode

7.2 years ago

Korsocius ▴ 270

Dear all,

could you help me please merge rows by coordinates in column $2. There are series of coordinates growing by one. I wan to output f.e. : 1st row merge to 4th row 9079811-9079814 and after that there is no series so merge it to another row etc.. for 3rd column in input I would like to count average.

I wrote some script, but this script merge all rows from first coordinate to last coordinate. no condition to series.

 awk -F'\t' -v OFS="\t" '{print $2,$4,$3,$1}' input | awk '!x[$2]{x[$2]=$1}y[$2]<$1{y[$2]=$1;}x[$2]>$1{x[$2]=$1} {sum+=$3} END{for(i in y)print $1,x[i],y[i],sum/NR,i}' | sort -V -k1,1 > output

INPUT:

chr12   9079811 29  A2M
chr12   9079812 29  A2M
chr12   9079813 29  A2M
chr12   9079814 28  A2M
chr12   9091202 5   A2M
chr12   9091203 5   A2M
chr12   9091204 5   A2M
chr12   9091390 15  A2M
chr12   9091391 15  A2M
chr12   9091392 13  A2M

OUTPUT:

chr12  9079811  9079814 28.75 A2M
chr12  9091202  9091204 5     A2M
chr12  9091390  9091392 14.3  A2M

AWK Merge • 1.7k views

ADD COMMENT • link updated 7.2 years ago by Alex Reynolds 36k • written 7.2 years ago by Korsocius ▴ 270

0

Entering edit mode

Why do you _need_ to use awk? Python is better suited, as your programming will be more readable.

ADD REPLY • link 7.2 years ago by Ram 45k

0

Entering edit mode

I think it won't be a problem to do it in python, but..I am beginner in this language :-(

ADD REPLY • link 7.2 years ago by Korsocius ▴ 270

0

Entering edit mode

All the more reason to use it - how else would you learn?

ADD REPLY • link 7.2 years ago by Ram 45k

score 1 · Answer 1 · 2018-01-26

You may want to use standard set and map operation toolkits, like BEDOPS.

BEDOPS will do a lot of the heavy lifting with set and map operations and reduce the likelihood of errors. It will almost definitely be faster than other tools or scripting languages.

First, convert the input to a sorted BED file, using awk and BEDOPS sort-bed:

$ awk -vOFS="\t" '{ print $1, $2, ($2+1), $4, $3; }' input.txt | sort-bed - > input.bed

Note that we reverse the order of score and ID fields. This is so that the resulting BED file follows UCSC convention, where the ID is in the fourth column, and the score data are in the fifth column.

Merge the intervals with BEDOPS bedops --merge:

$ bedops --merge input.bed > merge.bed

Map the input BED intervals to the merged intervals, using the --mean and --echo-map-id-uniq operators with BEDOPS bedmap to get back your answer:

$ bedmap --echo --mean --echo-map-id-uniq --delim '\t' merge.bed input.bed > answer.bed

Each line of answer.bed contains the merged interval, the mean of signal from input.bed over the merged interval, and the unique ID or IDs from input.bed over the merged interval.

The bedmap tool has lots of other score operators. So while you may want to use --mean here, you could also use --median or --min, --max, --sum, --stdev etc. depending on what you might need to do with signal down the road.

score 0 · Answer 2 · 2018-01-26

Hi Korsocius, please try this solution:

awk 'BEGIN {previousPos=9079810; nextPost=0; sum=0; linecount=0; startPos=0} {nextPos=$2; if (previousPos+1==nextPos) {previousPos=nextPos; sum+=$3; linecount+=1} else {print $1"\t"startPos"\t"previousPos"\t"sum/linecount"\t"$4; previousPos=nextPos; sum=0; sum+=$3; linecount=1}; if (linecount==1) startPos=$2} END {print $1"\t"startPos"\t"previousPos"\t"sum/linecount"\t"$4} ' input

chr12   9079811 9079814 28.75   A2M
chr12   9091202 9091204 5   A2M
chr12   9091390 9091392 14.3333 A2M