Hi all,
I am working with an mpileup file (generated via samtools) composed of pool-seq WGS of multiple populations. I want to remove all positions with read coverage <5 across any population.
Mpileup files are formatted so that the second column refers to the position, and all other columns containing integers refer to coverage. As such, I want the script to ignore the second column, but remove any lines with coverage between 0-5. I have tried both of the following scripts with vastly different results. Can anyone explain?
Which script is correct and how do they differ?
Script 1
awk '{ for (i = 1; i <= NF; i++) { if (i != 2 && $i < 5) next } } 1' input.mpileup > filteredoutput.mpileup
The output file here contains only 348570 positions
Script 2
awk 'BEGIN {OFS="\t"} {
flag = 0
for (i = 1; i <= NF; i++) {
if (i != 2 && int($i) == $i && $i >= 0 && $i <= 5) {
flag = 1
break
}
}
if (flag == 0) {
print
}
}' input.mpileup > filteredoutput.mpileup
The output file here contains 418393380 positions
Thanks!
Hi Philipp,
Thanks for the answer.
Unfortunately, because I have 24 populations, I have multiple columns (4,7,etc) corresponding to read depth. I suppose I could list all columns to be searched, but I thought excluding column 2 would be easier?
Thanks, MK