I wrote a python script to trim any 3' nucleotides from all reads in a fastq file (this is necessary for particular samples due to the library prep method). The script works, but it's very very slow. Any ideas as to how to speed it up? I suspect the step where the trimmed read is appended to the ouput file might be slowing things down - the reason I do this is to avoid loading the whole fastq file into memory in one go.
for read in SeqIO.parse(infile, "fastq"):
# keep trimming 3' end until all As are gone
while read.seq.endswith('A'):
read = read[:-1]
# need to update the read length in the "read description" field
read.description = re.sub('length=[0-9]*', 'length='+str(len(read)), read.description)
# append read to output file
with open(outfile, "a") as f:
SeqIO.write(read, f, "fastq")
NOTE: the above code describes the part of the script that actually does the business - I can post the rest of it (containing argument parsing etc) if desired.
Thanks in advance!
What's the format of your fastq headers?