Before switching to Biopython, I thought there are similar indexing features in biopython as in bioperl. However, the biopython SeqIO.index_db() and SeqIO.index() methods are so inefficient that it's almost impossible to random access a segment of genomic sequence using biopython.
I tested the performance of biopython and bioperl in retrieving a 10bp sequence from Arabidopsis chromosome 3, here is my code:
import sys
import os.path
sys.path.append(os.environ['PYTHON_HOME'] + '/lib/python2.6/site-packages')
sys.path.append(os.environ['PYTHON_HOME'] + '/lib64/python2.6/site-packages')
from time import clock, time
from subprocess import Popen, PIPE
from Bio import SeqIO
def seqret(f_seq, id, beg, end, strand=1):
if not os.path.isfile(f_seq):
sys.exit(f_seq + " is not there")
f_idx = f_seq + ".idx"
seq_idx = SeqIO.index_db(f_idx, f_seq, "fasta")
if not id in seq_idx:
sys.exit("cannot find " + id + "in " + f_seq)
seq_obj = seq_idx[id].seq[beg-1:end]
if strand == -1 or strand == "-":
seq_obj = seq_obj.reverse_complement()
seq_str = seq_obj.tostring()
return seq_str
if __name__ == "__main__":
seq_db = "/project/youngn/zhoup/Data/misc3/spada.crp/Athaliana/01_genome/01_refseq.fa"
seq_id = "Chr3"
seq_beg = 100001
seq_end = 100010
c0, t0 = clock(), time()
print seqret(seq_db, seq_id, seq_beg, seq_end, 1)
print "process time [%.2fs], wall time [%.2fs]" % (time()-t0, clock()-c0)
c0, t0 = clock(), time()
f_perl = "/soft/bioperl/5.16.1/bin/bp_seqret.pl"
cmd = '%s -d %s -i %s -s %d -e %d' % (f_perl, seq_db, seq_id, seq_beg, seq_end)
p = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
lines = p.stdout.readlines()
print lines
print "process time [%.2fs], wall time [%.2fs]" % (time()-t0, clock()-c0)
And here's the output I got:
AAAAGGAGCT
process time [1.65s], wall time [1.59s]
['>Chr3_100001-100010\n', 'AAAAGGAGCT\n']
process time [0.26s], wall time [0.00s]
So it seems biopython SeqIO.index_db() is much less efficient than bioperl, is it because it only index the sequences by IDs regardless of how large they are? I also noticed that the index file (.idx) created by biopython is much smaller than Bioperl index files (.index).
the perl script is part of BioPerl, you can check the code here (I'm to sure if it's the same version): http://www.gnu-darwin.org/www001/src/ports/biology/p5-bioperl-devel/work/bioperl-1.5.1/blib/script/bp_seqret.pl
Thanks for the answer, I guess the difference is Bioperl indexes sequences into chunks but Biopython does not. In this case, loading the entire chromosome into memory is not what I want in terms of random access.