Entering edit mode
5.0 years ago
Hann
▴
110
Hello,
I am trying to use awk command to calculate the average of the values in column 7, if the rows of column 2 have the same value
example data:
#CHROM wind_start wind_end CHROM POS POS relative_t_o relative_o_t
Dexi_CM05836_chr09A 1 10000 Dexi_CM05836_chr09A 1250 1250 1 1
Dexi_CM05836_chr09A 1 10000 Dexi_CM05836_chr09A 2215 2215 1 1
Dexi_CM05836_chr09A 1 10000 Dexi_CM05836_chr09A 2278 2278 1 1
Dexi_CM05836_chr09A 10001 20000 Dexi_CM05836_chr09A 10452 10452 1 1.095238095
Dexi_CM05836_chr09A 40001 50000 Dexi_CM05836_chr09A 46251 46251 1 1.047619048
Dexi_CM05836_chr09A 40001 50000 Dexi_CM05836_chr09A 41892 41892 1 1
Dexi_CM05836_chr09A 110001 120000 Dexi_CM05836_chr09A 109479 109479 1 0.673992674
Dexi_CM05836_chr09A 140001 150000 Dexi_CM05836_chr09A 141093 141093 0.913043478 0.727272727
Dexi_CM05836_chr09A 140001 150000 Dexi_CM05836_chr09A 141446 141446 1 1
so if the column wind_stat has value 1 in the rows then it will calculate the average of column 7 and so on:
wind_start average_relative_t_o
1 1
10001 1
40001 1
110001 1
140001 0.955
result will be I started with this:
awk '{b[$2] total +=$7}END {print total/NR}} END { for (i in b) { print b[i],i } } ' file
I know that this will calculate the average:
awk '{ total += $7 } END { print total/NR }' file
Any help would be appreciated
with datamash:
it's a nice way to do it but I don't know why for some reason, it jumps after 30 kb
example:
I am not sure what is going on. Probably it is not sorted proper. Could you please post a few records at the place record jumping (for eg 10 records that side and 10 records this side with identical 2nd column values)? @ haneenih7
if sorting is the issue, try this with tsv-utils: