Question

Python script to trim 3' A nucleotides running slow

0

Entering edit mode

9.2 years ago

dr_bantz ▴ 110

I wrote a python script to trim any 3' nucleotides from all reads in a fastq file (this is necessary for particular samples due to the library prep method). The script works, but it's very very slow. Any ideas as to how to speed it up? I suspect the step where the trimmed read is appended to the ouput file might be slowing things down - the reason I do this is to avoid loading the whole fastq file into memory in one go.

for read in SeqIO.parse(infile, "fastq"):
    # keep trimming 3' end until all As are gone
    while read.seq.endswith('A'):
        read = read[:-1]
    # need to update the read length in the "read description" field
    read.description = re.sub('length=[0-9]*', 'length='+str(len(read)), read.description)
    # append read to output file
    with open(outfile, "a") as f:
        SeqIO.write(read, f, "fastq")

NOTE: the above code describes the part of the script that actually does the business - I can post the rest of it (containing argument parsing etc) if desired.

Thanks in advance!

python • 2.9k views

ADD COMMENT • link updated 9.2 years ago by John 13k • written 9.2 years ago by dr_bantz ▴ 110

0

Entering edit mode

What's the format of your fastq headers?

ADD REPLY • link 9.2 years ago by Damian Kao 16k

1

Entering edit mode

9.2 years ago

liangjiao.xue ▴ 100

Regular expression would be a good solution to avoid checking each nt.

import re  
pattern = re.compile(r"A+$") # put this pattern out side of the loop  

with open(outfile, "w") as f:  # avoid open files in the loop if not necessary   
   for read in SeqIO.parse(infile, "fastq"):    
        # match As the 3' end and trim them  
         m = pattern.search(read)    
         read = read[0:m.start()]    

        # need to update the read length in the "read description" field
        read.description = re.sub('length=[0-9]*', 'length='+str(len(read)), read.description)
        # append read to output file    
        SeqIO.write(read, f, "fastq")

ADD COMMENT • link 9.2 years ago by liangjiao.xue ▴ 100

1

Entering edit mode

9.2 years ago

John 13k

Although the other answers totally solve your issue, the reason it was slow was 100% because you open the file for appending for every read.

You won't find someone who dislikes unnecessary use of regex (and python modules like SeqIO for really basic things) more than I -- however, credit where credit is due, those guys are not the issue here at all. The only modification you really need to make was to go from:

for read in SeqIO.parse(infile, "fastq"):
    while read.seq.endswith('A'): read = read[:-1]
    read.description = re.sub('length=[0-9]*', 'length='+str(len(read)), read.description)
    with open(outfile, "a") as f:
        SeqIO.write(read, f, "fastq")

to

with open(outfile, "a") as f:
    for read in SeqIO.parse(infile, "fastq"):
        while read.seq.endswith('A'): read = read[:-1]
        read.description = re.sub('length=[0-9]*', 'length='+str(len(read)), read.description)
        SeqIO.write(read, f, "fastq")

ADD COMMENT • link 9.2 years ago by John 13k

score 3 · Accepted Answer · 2016-05-13

3

Entering edit mode

9.2 years ago

Damian Kao 16k

Slowdown is probably in the regex and the with open(outfile) statement.

I would declare an outfile outside of the loop and use the same file variable for writing.

I am not sure what your header format is, but I assume it is something like: @readID length=xx field1=xx field2=xx...

Here is a modified version of the code: (don't need biopython for parsing and writing fq files. It's not very hard to parse)

f = open(outFile,'a')
for header in fqFile:
    seq = fqFile.next().strip()
    fqFile.next()
    qual = fqFile.next().strip()

    while seq[-1] == 'A':
        seq = seq[:-1]

    x = header.split('length=')
    header = x[0] + 'length=' + str(len(seq)) + ' ' + ' '.join(x[1].split()[1:])

    f.write(header + '\n' + seq + '\n+\n' + qual)

f.close()

ADD COMMENT • link 9.2 years ago by Damian Kao 16k

0

Entering edit mode

Awesome, it's turbo-fast now! I just had to add an extra '\n' at the end of the f.write statement (and of course I added fqFile = open(in_file, 'r') before the for loop). Thanks a lot!

ADD REPLY • link 9.2 years ago by dr_bantz ▴ 110

0

Entering edit mode

If there's a lot of trimming going on, then allocating a new array each time you trim and A will be costly. Maybe keeping track of the trimming index would be more efficient? Something like:

for i, base in enumerate(seq[::-1]):
  if base != 'A':
    seq = seq[:i]

ADD REPLY • link 9.2 years ago by Matt Shirley 10k