Python script to trim 3' A nucleotides running slow
3
0
Entering edit mode
8.6 years ago
dr_bantz ▴ 110

I wrote a python script to trim any 3' nucleotides from all reads in a fastq file (this is necessary for particular samples due to the library prep method). The script works, but it's very very slow. Any ideas as to how to speed it up? I suspect the step where the trimmed read is appended to the ouput file might be slowing things down - the reason I do this is to avoid loading the whole fastq file into memory in one go.

for read in SeqIO.parse(infile, "fastq"):
    # keep trimming 3' end until all As are gone
    while read.seq.endswith('A'):
        read = read[:-1]
    # need to update the read length in the "read description" field
    read.description = re.sub('length=[0-9]*', 'length='+str(len(read)), read.description)
    # append read to output file
    with open(outfile, "a") as f:
        SeqIO.write(read, f, "fastq")

NOTE: the above code describes the part of the script that actually does the business - I can post the rest of it (containing argument parsing etc) if desired.

Thanks in advance!

python • 2.5k views
ADD COMMENT
0
Entering edit mode

What's the format of your fastq headers?

ADD REPLY
3
Entering edit mode
8.6 years ago

Slowdown is probably in the regex and the with open(outfile) statement.

I would declare an outfile outside of the loop and use the same file variable for writing.

I am not sure what your header format is, but I assume it is something like: @readID length=xx field1=xx field2=xx...

Here is a modified version of the code: (don't need biopython for parsing and writing fq files. It's not very hard to parse)

f = open(outFile,'a')
for header in fqFile:
    seq = fqFile.next().strip()
    fqFile.next()
    qual = fqFile.next().strip()

    while seq[-1] == 'A':
        seq = seq[:-1]

    x = header.split('length=')
    header = x[0] + 'length=' + str(len(seq)) + ' ' + ' '.join(x[1].split()[1:])

    f.write(header + '\n' + seq + '\n+\n' + qual)

f.close()
ADD COMMENT
0
Entering edit mode

Awesome, it's turbo-fast now! I just had to add an extra '\n' at the end of the f.write statement (and of course I added fqFile = open(in_file, 'r') before the for loop). Thanks a lot!

ADD REPLY
0
Entering edit mode

If there's a lot of trimming going on, then allocating a new array each time you trim and A will be costly. Maybe keeping track of the trimming index would be more efficient? Something like:

for i, base in enumerate(seq[::-1]):
  if base != 'A':
    seq = seq[:i]
ADD REPLY
1
Entering edit mode
8.6 years ago
liangjiao.xue ▴ 100

Regular expression would be a good solution to avoid checking each nt.

import re  
pattern = re.compile(r"A+$") # put this pattern out side of the loop  

with open(outfile, "w") as f:  # avoid open files in the loop if not necessary   
   for read in SeqIO.parse(infile, "fastq"):    
        # match As the 3' end and trim them  
         m = pattern.search(read)    
         read = read[0:m.start()]    

        # need to update the read length in the "read description" field
        read.description = re.sub('length=[0-9]*', 'length='+str(len(read)), read.description)
        # append read to output file    
        SeqIO.write(read, f, "fastq")
ADD COMMENT
1
Entering edit mode
8.6 years ago
John 13k

Although the other answers totally solve your issue, the reason it was slow was 100% because you open the file for appending for every read.

You won't find someone who dislikes unnecessary use of regex (and python modules like SeqIO for really basic things) more than I -- however, credit where credit is due, those guys are not the issue here at all. The only modification you really need to make was to go from:

for read in SeqIO.parse(infile, "fastq"):
    while read.seq.endswith('A'): read = read[:-1]
    read.description = re.sub('length=[0-9]*', 'length='+str(len(read)), read.description)
    with open(outfile, "a") as f:
        SeqIO.write(read, f, "fastq")

to

with open(outfile, "a") as f:
    for read in SeqIO.parse(infile, "fastq"):
        while read.seq.endswith('A'): read = read[:-1]
        read.description = re.sub('length=[0-9]*', 'length='+str(len(read)), read.description)
        SeqIO.write(read, f, "fastq")
ADD COMMENT

Login before adding your answer.

Traffic: 1627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6