Hey everybody. I have written a custom python script (below) but it is too slow and I need pointers as to where the code can be improved.
What this script is doing:
I have two files, the first is a large fastq file (single-end reads) and the second is a text file with all of the header lines in the fastq file that I want to keep. The text file and fastq file are in same order with some of the fastq reads not having a corresponding header line in the text file. This script walks down both files and writes the desired reads to a new file. This script works but it is too slow, my large fastq file is ~20GB (~100millionreads) and my current estimate puts me at ~20days for it to finish (I ran it on a small sample). As someone with not much experience in bigdata, I'm sure there is something obvious that I could be doing to speed this up, any help is appreciated.
#!/usr/bin/python3
import sys
from itertools import islice
# first input (sys.argv[1]) is the original fq file name
# second input (sys.argv[2]) is the quality threshold used (e.g. '30')
print("\nentering python script")
header_file_name = sys.argv[2] + "_good_read_headers.txt"
newfile_name = sys.argv[2] + "_good_reads.fastq"
print ("...making new fastq file for: " + goodreads)
file_iterator = 0
head_iterator = 0
while True:
four_line_fq, header = [], []
with open(sys.argv[1], 'r') as name_original: # opening original fq file
lines_gen = islice(name_original, int(file_iterator * 4), int(file_iterator * 4 + 4))
for i in lines_gen:
four_line_fq.append(i)
with open(header_file_name, 'r') as name_header: # opening good header names only file
headers_gen = islice(name_header, head_iterator, head_iterator + 1)
for i in headers_gen:
header.append(i)
if len(four_line_fq) == 0: # if at end of original fq file
print("...end of original fastq file reached, done writing")
break
if four_line_fq[0] == header[0]:
with open(newfile_name, 'a+') as new:
for i in four_line_fq:
new.write(i)
head_iterator += 1
file_iterator += 1
print("...exiting python script\n")
Completely agree with Ram. Quickly jump to Biopython
Your strategy of re-opening the files for each slice is very time consuming. You should open each input file once, then iterate through the lines. Same goes for the output file which should be opened just once.
Why are you writing a parser yourself? There are multiple tools that do this. IMO people have to stop writing stuff from scratch unless they have an extremely good reason to do so.
Although I agree that using well tested and optimized modules is the way to go, a good reason to write your own parser would be to learn how to do the programming yourself, rather than using the code of others...
For widely-performed-on-large-files tasks like these, I guess I prioritize getting stuff done over learning efficiency intricacies. But that's my bias against tools that re-invent the wheel.