Entering edit mode
10.1 years ago
always_learning
★
1.1k
Hi all,
I am using PYSAM module for one of my scripts where I am working on pretty large VCF files but job is not completing everytime and showing memory issue. I tried to run this with large and faster machine though. Did any one face similar issue with pysam earlier too with large files ?
This is my python script:
import sys
import os
import pysam
freq_dir_file=sys.argv[1]
vcf_dir_file = sys.argv[2]
snp_pos=[]
os.environ['vcf_file'] = vcf_dir_file
os.system("zcat $vcf_file | head -5000 | parallel --pipe grep '^#'")
data = open(freq_dir_file)
for line in data:
if not line.startswith("CHROM") and not line.strip().split("\t")[0] == "NA":
col = line.strip().split("\t")[4:]
for i in col:
val = i.strip().split(":")[1]
num = float(val)
#Comment this if its for Low frequency variants
#if num > 0.005 and num < 0.050:
#Comment this if its for coding region
if num < 0.005 and not num == 0:
check = 1
else:
pass
if check == 1:
chmpos = line.strip().split("\t")[0] +" "+ line.strip().split("\t")[1]
snp_pos.append(chmpos)
tabixfile = pysam.Tabixfile(vcf_dir_file)
for i in snp_pos:
(chrom, snp) = i.split(" ")[0], i.split(" ")[1]
val = int(snp)-1
for vcf in tabixfile.fetch(str(chrom), val, int(snp)):
print vcf
Are you sure this is a memory leak in pysam? Python itself isn't exactly the best with memory management, so if
freq_dir_file
is large then I could seesnp_pos
blowing up the available memory. Having said that, I've never looked at the underlying tabix C code, so perhaps there's an issue there.Since I am working on 32 GIGS RAM then chances of blowing up whole system memory with
snp_pos
is highly unlikely.Anyone?
Do you know at which line in the code the Memory leak occurs? And how many items do you expect in
snp_pos
? Oh and can you give us the log of the error.Try filing an issue.