I recently enquired about an AWK script to keep summing a column until it reaches a certain value then print that line.
The Awk script I got from Alex Reynolds was very helpful and is show below
$ awk '\
BEGIN { \
s = 0; \
} \
{ \
s += $4; \
if (s >= 100) { \
print $0; \
exit; \
} \
}' chr1.bedgraph
I would use this on chr1.bedgraph
chr1.bedgraph
chr1 1000 2000 25
chr1 2000 3000 50
chr1 3000 4000 25
chr1 4000 5000 30
And the awk script would print the line were the sum of $4 is reaches 100
chr1 3000 4000 25
I now want to replace "100" in the line "if(s >=100)" with every value in values.txt (apart from $1)
values.txt
sample1_chr1 200 50 90
sample2_chr1 300 60 40
sample3_chr1 400 20 40
So the script would essentially use the numbers values.txt from line1 $2, $3, $4 then move on to line2 and line3 and so on.
So that the output would print the line from chr1.bedgraph that when it reaches 200 then below that, 50, then 90. then it would print the lines when it reaches 300, 60 and 40.
Any thoughts? I'm not a very experienced programmer and I have been trying to do this for a while now.
Many thanks
If it gets slightly more complicated (like your question now) I guess it becomes time to move away from awk to e.g. python. It's not completely clear what you try to accomplish (and with which purpose).
You are right - I feel I am pushing the limits of awk and its probably time to move over to python. My apologies for not being more clear. The example I had shown wasn't very well put together. I am basically obtaining "median" values over large peak domains in ChIP-seq with the intention to show movement these genetic loci between individuals(sample1, sample2, sample3). I obtain a total read count across the domain and extract the point where the median read count value lies. Think of it as a centre of gravity value(point at which 50% of the reads lies). I then want to find the 5% and 95% values. My values.txt file contains 4 columns - sample_chr, 5%, 50%, 95%. If i had a script that would use my 5%, 50% & 95% read count values to scan my pre defined domains (like the awk script was doing) it would make the processing a lot faster...Is this a little more clear?
Note: my values.txt file does not represent the real values so that would also make things more confusing.