Hi everyone,
I have a file with about 77,000 3'-utr region and I used Entrez.efetch to get the sequence of each region. I find the speed is slow (about 0.5 sec to get 1 sequence).
My code is like:
from Bio import Entrez, SeqIO
from Bio.SeqRecord import SeqRecord
f=open("utr3hg19.txt","r") # open the file contains all human 3'-utr coordinates
# each line contains information of one 3'-utr
# column 2,3,4,5 represent gi, strand, start, end
# split by tab
data=f.readlines()
f.close()
i=1 # skip the first line
f=open("utr3.fasta","w") # sequences will be written into this file
while i<len(data):
temp=data[i].split("\t")
Entrez.email = "A.N.Other@example.com"
handle = Entrez.efetch(db="nucleotide",
id=temp[1],
rettype="fasta",
strand=temp[2],
seq_start=int(temp[4]),
seq_stop=int(temp[3]))
record = SeqIO.read(handle, "fasta")
handle.close()
r=SeqRecord(record.seq,data[i],"","")
d=[]
d.append(r)
SeqIO.write(d,f,"fasta")
i+=1
f.close()
Is that due to my bad coding? Or it's a network problem...? BTW, I run the code on a 12-core linux server.
+1 and also see this post: Batch Fetching Fasta Sequences From Bed File
Thanks a lot, Peter. I used to search locally. Yesterday I suddenly wanna whether this can be done by biopython from NCBI, if I meet a species whose genome is not stored locally (I'm lazy to download genome.:( )