I have a multi-FASTA file having ~125 protein sequences. I need to perform a BLASTP seach against remote nr
database. I tried using NcbiblastpCommandline
, but the issue is that it only accepts files as input. Since my file has a huge number of sequences, I get this error ERROR: An error has occurred on the server, [blastsrv4.REAL]:Error: CPU usage limit was exceeded, resulting in SIGXCPU (24)
. Storing each sequence from the multi-FASTA file to a separate file at a time works, but then the BLAST search becomes tremendoulsy slow (~10 min/query on an average as opposed to ~1 min/query on the NCBI site).
blastp_results = []
from Bio.Blast.Applications import NcbiblastpCommandline
from Bio import SeqIO
record_iterator = SeqIO.parse("AmpB_DEPs.fasta", "fasta")
for record in record_iterator:
entry = str(">" + i.description + "\n" + i.seq)
f1 = open("test.txt", "w")
f1.write(entry)
f1.close()
f2 = open("test.txt", "r")
blastp_cline = NcbiblastpCommandline(query = 'test.txt', db = 'nr -remote', evalue =
0.05, outfmt = '7 sseqid evalue qcovs pident')
res = blastp_cline()
blastp_results.append(res)
f2.close()
I also tried using NCBIWWW.qblast
but it doesn't seem to provide Query coverage
information in the output, something which is important for my study.
Can somebody suggest a way to deal with this issue without compromising on search space or default parameters of BLAST? Suggestions on implementing BLAST in other languages such as PERL, R etc. would also be appreciated.
Public resources are there for all to share in a fair manner. What you are trying to do is exceeding the limits set for what NCBI considers fair use. You can either be patient and wait to get your results. If you try to defeat the guards put in place by NCBI you may get IP banned. You can always invest in a cloud computing environment and do the search on a VM with multiple cores/plenty of RAM and be done sooner.