awk: cmd. line:1: fatal: division by zero attempted when calculating average depth when using subprocess
0
0
Entering edit mode
2.4 years ago
Manuel ▴ 50

I am analysing the average depth of a bam file I show the full code below.

When I used in the terminal this

 awk '{ total += $3; count++ } END { print total/count }' AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt

This gave me 131.3

However when I put this in a subprocess

 command = "awk '{ total += $3; count++ } END { print total/count }' AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt"

subprocess.Popen(command, shell=True)

I got this awk: cmd. line:1: fatal: division by zero attempted

Why??

The input file in the same in both cases

This is the full script

     #!/usr/bin/env python3



import subprocess
import sys


subprocess.Popen('module load samtools', shell=True)
# Take bam file to analysis
#bam = input()
bam = '220706_M00321_0141_000000000-KFRDV_validation8_WRGL4_1190863240.bam'
print(f'Analysing Bam file {bam}')

# Limite the bam file to ROI taken tyhe bed file used


def run(cmd) :
   proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
   proc.communicate()

cmd = f"samtools view  {bam} --target-file ../WRGL4_hg19_nonstandard-expanded_v04.bed -bo ROI.bam"
run(cmd)



print('BAM file limited to ROI created')

print('Getting average depth of all bases')


def run(coverage) :
   proc = subprocess.Popen(coverage, shell=True, stdout=subprocess.PIPE)
   proc.communicate()

coverage = f"samtools depth ROI.bam > AVERAGE_DEPTH.txt"
run(coverage)
print('AVERAGE_DEPTH.txt created')

print('Removing unknown chromosomes')


keep = ('1','2','3','4','5','6','7','8','9','X','Y')

with open('AVERAGE_DEPTH.txt', 'r') as inputFile, open('AVERAGE_DEPTH_ONLY_GOOD_CHR.txt', 'w') as outputFile:
    for line in inputFile.readlines():
        if  line.startswith (keep):
            outputFile.writelines(line)


print('AVERAGE_DEPTH_ONLY_GOOD_CHR.txt created')

subprocess.Popen("awk '$3 >= 21' AVERAGE_DEPTH_ONLY_GOOD_CHR.txt  > AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt", shell=True, stdout=subprocess.PIPE)

print('AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt created')

print('\n Average depth is:')

command = "awk '{ total += $3; count++ } END { print total/count }' AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt"

subprocess.Popen(command, shell=True)
average_depth awk • 1.3k views
ADD COMMENT
0
Entering edit mode

That error seems to imply an empty file, and your command works fine with a made-up test file for me. Are you sure you tested it with the same file contents your script is working with? If that's not it, maybe try with a simplified awk command for troubleshooting, like:

command = "awk 'NR <= 10' AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt"
subprocess.Popen(command, shell=True)

That will just print the first 10 lines, so, like running head. (But again, I expect your actual file is empty so this wouldn't show you anything.)

ADD REPLY
0
Entering edit mode

The file is not empty because I have delete every file create in previous run, I have run the script again and then I have done

head AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt

1       2160202 2
1       2160203 2
1       2160204 2
1       2160205 2
1       2160206 2
1       2160207 2
1       2160208 2
1       2160209 2
1       2160210 2
1       2160211 2

I have introduce the new line you proposed in your comment and the error is even more strange. I show you the full output

Analysing Bam file 220706_M00321_0141_000000000-KFRDV_validation8_WRGL4_1190863240.bam
BAM file limited to ROI created
Getting average depth of all bases
AVERAGE_DEPTH.txt created
Removing unknown chromosomes
AVERAGE_DEPTH_ONLY_GOOD_CHR.txt created
AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt created

 Average depth is:
[mdb1c20@cyan52 1190863240]$ 220706_M00321_0141_000000000-KFRDV_validation8_WRGL4_1190863240.bam
-bash: ./220706_M00321_0141_000000000-KFRDV_validation8_WRGL4_1190863240.bam: cannot execute binary file

Thanks for your time by the way

ADD REPLY
0
Entering edit mode

That is even weirder. I'm not sure what's going on there.

Maybe just run the problematic awk itself within Python, to narrow it down? You can explicitly split the arguments yourself as a list before calling the process rather than invoking a separate shell instance, which the Python docs claim is supposedly necessary when there are arguments:

On POSIX, if args is a string, the string is interpreted as the name or path of the program to execute. However, this can only be done if not passing arguments to the program.

Even though giving extra arguments in a string does seem to work as far as I can tell... but I always use a list, myself. I think that will make the situation simpler:

#!/usr/bin/env python3
import subprocess
command = ["awk", "NR <= 10", "AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt"]
subprocess.Popen(command)

But you could even drop awk altogether, really... you could take the average of your third column with few lines of plain Python and not need a subprocess at all.

#!/usr/bin/env python
with open("AVERAGE_DEPTH_GOOD_CHR_MORE_THAN_20.txt") as f_in:
    nums = [int(line.strip().split("\t")[2]) for line in f_in]
    print(sum(nums)/len(nums))
ADD REPLY

Login before adding your answer.

Traffic: 1943 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6