remove low coverage mpileup file
1
0
Entering edit mode
13 months ago
mk-dna • 0

Hi all,

I am working with an mpileup file (generated via samtools) composed of pool-seq WGS of multiple populations. I want to remove all positions with read coverage <5 across any population.

Mpileup files are formatted so that the second column refers to the position, and all other columns containing integers refer to coverage. As such, I want the script to ignore the second column, but remove any lines with coverage between 0-5. I have tried both of the following scripts with vastly different results. Can anyone explain?

Which script is correct and how do they differ?

Script 1

awk '{ for (i = 1; i <= NF; i++) { if (i != 2 && $i < 5) next } } 1' input.mpileup > filteredoutput.mpileup

The output file here contains only 348570 positions

Script 2

awk 'BEGIN {OFS="\t"} {
    flag = 0
    for (i = 1; i <= NF; i++) {
        if (i != 2 && int($i) == $i && $i >= 0 && $i <= 5) {
            flag = 1
            break
        }
    }
    if (flag == 0) {
        print
    }
}' input.mpileup > filteredoutput.mpileup

The output file here contains 418393380 positions

Thanks!

mpileup coverage pool-seq • 606 views
ADD COMMENT
0
Entering edit mode
13 months ago

I don't really understand the logic behind both scripts, why do you have the $i operator at all? You're iterating over all columns of all rows, but you need to look only at DP?

Here's an example samtools-generated mpileup line with a depth of 4:

1       152715  T       4       ,,..    EEEA

I can't test Section 2 on that, I get a syntax error for that.

To filter by read-depth, all you have to do is

awk '{if ($4 > 5) print}' input > output

Here's the same for bcftools:

Here's an example mpileup line just made with bcftools mpileup v1.15, a small indel where reference A is replaced by AC

1       157103  .       A       AC      0       .       INDEL;IDV=4;IMF=1;DP=4;I16=0,0,3,1,0,0,160,6400,0,0,168,7056,0,0,95,2275;QS=0,1;VDB=0.014407;SGB=-0.556411;MQSBZ=0;FS=0;MQ0F=0       PL      140,12,0

So we have four reads (DP=4), where it's easiest to use bcftools directly:

bcftools view -i'DP>5'  input.mpileup > filteredoutput.mpileup
ADD COMMENT
0
Entering edit mode

Hi Philipp,

Thanks for the answer.

Unfortunately, because I have 24 populations, I have multiple columns (4,7,etc) corresponding to read depth. I suppose I could list all columns to be searched, but I thought excluding column 2 would be easier?

Thanks, MK

ADD REPLY

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6