SeqIO.write() in biopython is extremely slow
2
0
Entering edit mode
9.0 years ago

Hi there!

I've got this strange issue with SeqIO.write() in Biopython. It is extremely slow in one of my script but not in my another script which is similar to that script. I timed the script at each stage so I knew where it is the SeqIO.write() problem.

The first script (fast) is to randomly select 500000 sequences from the fastq file (read 1). The second script is to find the matching sequences in another fastq file (read2). The first sequence took about 4 min in total for randomly selection from on average 900000 reads and to write in a file. But the second took almost 3 hours just to write in a file.

Please ignore the loop, it was just used to perform on all my files from all samples. Any help would be greatly appreciated. Thanks!

The 1st script (fast one):

#! /usr/bin/ python
#This script is to randomly select a pre-defined number of sequences from multiple fastq files (jackknifing) with python
from Bio import SeqIO
import glob
import random
import time
print time.time()
FNS=glob.glob("/home/et/Desktop/PP15_META/01Geneious/R1/*")
for FN in FNS:
    SEQ_INDEX=SeqIO.index(FN,"fastq")
    SELECTED_SEQ_INDEX=random.sample(SEQ_INDEX,500000) # Randomly selected 500000 sequence index
    SELECTED_SEQ=(SEQ_INDEX[SELECTION] for SELECTION in SELECTED_SEQ_INDEX)
    FN_OUT=FN.replace(".fastq","_JACK.fastq")
    COUNT=SeqIO.write(SELECTED_SEQ,FN_OUT,"fastq")
print time.time()

The 2nd script (slow):

import time
import glob
from Bio import SeqIO
FNS_R2=glob.glob("/home/et/Desktop/PP15_META/01Geneious/*2.fastq")
FNS_R1=glob.glob("/home/et/Desktop/PP15_META/01Geneious/R1/*_JACK.fastq")
N_FNS=len(FNS_R1)
for I in range(1,len(FNS_R1)):
    print time.time()
    SEL_R1=SeqIO.parse(FNS_R1[i],"fastq")
    COM_R1=[record.id.split("_")[0] for record in SEL_R1]
    print "Target number of sequences is: ", len(COM_R1)
    print time.time()
    SEL_R2 = (record for record in SeqIO.parse(FNS_R2[i],"fastq") if record.id.split("_")[0] in COM_R1)
    print time.time()
    FN_OUT=FNS_R2[i].replace(".fastq","_JACK.fastq")
    print "Start to write"
    COUNT=SeqIO.write(SEL_R2, FN_OUT, "fastq")
    print COUNT
    print time.time()
    break
Biopython • 4.6k views
ADD COMMENT
1
Entering edit mode
9.0 years ago
Peter 6.0k

If you want a reasonably fast Biopython based solution to the second task (picking out records from a FASTQ file given a list of IDs), try looking at the FASTQ part of this example:

https://github.com/peterjc/pico_galaxy/blob/master/tools/seq_select_by_id/seq_select_by_id.py

This uses SeqIO.index to scan the FASTQ file to get the identifiers and the file offsets, but never bothers to parse the FASTQ file into in-memory SeqRecord objects (which you would output using SeqIO.write). Instead it copies the raw record out of the source file into the destination file (using the get_raw method of the dictionary-like sequence file index).

ADD COMMENT
0
Entering edit mode
9.0 years ago
matted 7.8k

Biopython is convenient, but not super quick in my experience. It also keeps everything in memory, at least the way it seems that you're using it. Instead of writing your own script to do this this common task, I'd recommend using an existing and efficient tool like seqtk.

You can accomplish your core task in two lines with it:

seqtk sample input1.fq 500000 | seqtk seq -a - | grep -F ">" | cut -c 2- > readnames.txt
seqtk subseq input2.fq readnames.txt > output.fq
ADD COMMENT
0
Entering edit mode

Thanks! I'll try that. I was just curious why that happened. Two similar codes, but one is very slow. I tried to use a simple handle.write() to do the job for the slow script, but also very slow. I am start to guess it might not be the SeqIO.write() problem. I am using ubuntu in a VirtualBox.

ADD REPLY
1
Entering edit mode

I haven't looked too closely, so I might be wrong, but this line sticks out:

SEL_R2 = (record for record in SeqIO.parse(FNS_R2[I],"fastq") if record.id.split("_")[0] in COM_R1)

You're comparing every record in FNS_R2[I] to every record in COM_R1 in an inefficient (quadratic) way. If I'm right and this is the problem, you can work around it slightly by converting COM_R1 to a set first. Then your search (the in test) will be constant time instead of a linear one going through the whole list sequentially.

ADD REPLY
0
Entering edit mode

Thanks for replying. I understood the comparison issue. Since my problem is pretty simple so I did not optimize on that. And the problem of the script being slow was not due to that. I timed the script and it was just because of the SeqIO.write this single command.

Anyway, I usd seqtk and it works like a charm.

Thanks!

ADD REPLY
1
Entering edit mode

It is good that you've measured things, but you have probably been mislead. The SeqIO.write call would look slow because it includes evaluating the generator expression SEL_R2 which @matted pointed out as a possible bottleneck.

Probably the simplest change is to make COM_R1 into a Python set rather than a list, which speeds up membership testing enormously.

ADD REPLY
0
Entering edit mode

As written this is using iterators and generator expressions, so no, it does not keep everything in memory.

ADD REPLY

Login before adding your answer.

Traffic: 1494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6