Speed Of Efetch In Biopython
2
1
Entering edit mode
13.2 years ago
dustar1986 ▴ 380

Hi everyone,

I have a file with about 77,000 3'-utr region and I used Entrez.efetch to get the sequence of each region. I find the speed is slow (about 0.5 sec to get 1 sequence).

My code is like:

from Bio import Entrez, SeqIO
from Bio.SeqRecord import SeqRecord

f=open("utr3hg19.txt","r")        # open the file contains all human 3'-utr coordinates
                                  # each line contains information of one 3'-utr
                                  # column 2,3,4,5 represent gi, strand, start, end
                                  # split by tab
data=f.readlines()
f.close()

i=1                               # skip the first line 
f=open("utr3.fasta","w")          # sequences will be written into this file
while i<len(data):
    temp=data[i].split("\t")
    Entrez.email = "A.N.Other@example.com"
    handle = Entrez.efetch(db="nucleotide",
                          id=temp[1],           
                          rettype="fasta",
                          strand=temp[2],
                          seq_start=int(temp[4]),
                          seq_stop=int(temp[3]))

    record = SeqIO.read(handle, "fasta")
    handle.close()
    r=SeqRecord(record.seq,data[i],"","")
    d=[]
    d.append(r)       
    SeqIO.write(d,f,"fasta")
    i+=1
f.close()

Is that due to my bad coding? Or it's a network problem...? BTW, I run the code on a 12-core linux server.

biopython sequence retrieval eutils • 6.0k views
ADD COMMENT
10
Entering edit mode
13.2 years ago
Peter 6.0k

I think trying to make 77,000 calls to EFetch is in danger of breaching the NCBI usage guidelines. Make sure you do NOT run any script making more than 100 Entrez calls during USA office hours. Otherwise you may be banned by the NCBI.

For the sake of argument, let's suppose each query is instantaneous. You are still limited to 3 queries per second, so 77,000 calls will need over 7 hours!

In this situation, I would download the hg19 chromosomes (as FASTA or even GenBank), and extract the subsequences locally. OK the initial download will take a little while, but the extract script will be MUCH faster.

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks a lot, Peter. I used to search locally. Yesterday I suddenly wanna whether this can be done by biopython from NCBI, if I meet a species whose genome is not stored locally (I'm lazy to download genome.:( )

ADD REPLY
3
Entering edit mode
13.2 years ago
Leszek 4.2k

I recommend using ePost instead of eFetch. You will search for all you GIs at once, literally one query instead of 77 thousands. And then you can get your results in batch packets (100 sequence at once or more).

Once, you have downloaded all your GI entries, you have to parse them and retrieve only pieces you need (I believe you can write Biopython code yourself). I've been using this method to download all proteins from particular species, and I assure you it's very fast (several thousands sequences per minute). It strongly depends on NCBI load at given moment and you internet connection of course;)

[?]

ADD COMMENT
0
Entering edit mode

Thanks, Leszek. ePost is great. I'm quite new to biopython and should read more about its handbbook. Sorry to trouble you.

ADD REPLY

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6