Question

remove low coverage mpileup file

0

Entering edit mode

18 months ago

mk-dna • 0

Hi all,

I am working with an mpileup file (generated via samtools) composed of pool-seq WGS of multiple populations. I want to remove all positions with read coverage <5 across any population.

Mpileup files are formatted so that the second column refers to the position, and all other columns containing integers refer to coverage. As such, I want the script to ignore the second column, but remove any lines with coverage between 0-5. I have tried both of the following scripts with vastly different results. Can anyone explain?

Which script is correct and how do they differ?

Script 1

awk '{ for (i = 1; i <= NF; i++) { if (i != 2 && $i < 5) next } } 1' input.mpileup > filteredoutput.mpileup

The output file here contains only 348570 positions

Script 2

awk 'BEGIN {OFS="\t"} {
    flag = 0
    for (i = 1; i <= NF; i++) {
        if (i != 2 && int($i) == $i && $i >= 0 && $i <= 5) {
            flag = 1
            break
        }
    }
    if (flag == 0) {
        print
    }
}' input.mpileup > filteredoutput.mpileup

The output file here contains 418393380 positions

Thanks!

mpileup coverage pool-seq • 811 views

ADD COMMENT • link 18 months ago by mk-dna • 0

score 0 · Answer 1 · 2023-10-29

I don't really understand the logic behind both scripts, why do you have the $i operator at all? You're iterating over all columns of all rows, but you need to look only at DP?

Here's an example samtools-generated mpileup line with a depth of 4:

1       152715  T       4       ,,..    EEEA

I can't test Section 2 on that, I get a syntax error for that.

To filter by read-depth, all you have to do is

awk '{if ($4 > 5) print}' input > output

Here's the same for bcftools:

Here's an example mpileup line just made with bcftools mpileup v1.15, a small indel where reference A is replaced by AC

1       157103  .       A       AC      0       .       INDEL;IDV=4;IMF=1;DP=4;I16=0,0,3,1,0,0,160,6400,0,0,168,7056,0,0,95,2275;QS=0,1;VDB=0.014407;SGB=-0.556411;MQSBZ=0;FS=0;MQ0F=0       PL      140,12,0

So we have four reads (DP=4), where it's easiest to use bcftools directly:

bcftools view -i'DP>5'  input.mpileup > filteredoutput.mpileup