I have a Python script that runs BioPython's Web Blast function. We're using large fasta files, so the script breaks these files into smaller files and then blasts them. Below is a partial query:
The frustrating part is that some of the smaller files work and some don't, even though they appear to be the exact same format. Plus the failed file works when going through the NCBI Blast web page. Below is the error message I get when run from my script:
where MEGA_BLAST is a boolean. Anyone have any idea why it would fail? The input string, as far as I can tell, is fine. I have no idea why this is occuring.
You're wrong Michael, blastn DOES work with nr - see below. It is probably treated as an alias for nt, given the NCBI refer to it a "Nucleotide collection (nt/nr)" on the BLASTN website. It is surprising through as NR normally means the protein database.
You're wrong Michael, QBLAST with "blastn" DOES work with nr - see below. It is probably treated as an alias for nt, given the NCBI refer to it a "Nucleotide collection (nt/nr)" on the BLASTN website. It is surprising through as NR normally means the protein database.
If you BLAST the sequences manually on the NCBI BLAST site they all get the following result:
"No significant similarity found. For reasons why,click here."
Below are common reasons that a BLAST search results in the "No significant similarity
found" message.
Short query sequences: Short alignments may have Expect values above the default
threshold, which is 10 on most pages, and, therefore, are not displayed. Try increasing
the Expect threshold (under 'Algorithm parameters'). Also, see the FAQ Submitting
primers or other short sequences.
So one possibility is that your alignments are to short and that this error returns an error code 1.
EDIT:
I see that you're not only breaking up one file of fasta sequences into smaller files of complete fasta sequences, you are also chopping up the genes themselves
(e.g.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
From this example, 1, 2 and 5 can be found, the others cannot. Again, I don't know if
"No significant similarity found. For reasons why,click here" in the NCBI web application gives an error(1) in the biopython blast, but that would be my safest bet. Try keeping the sequences intact.
Because of short query sequences? No, that can't be it, considering I've tested the exact same program with a test fasta file, one that has some sequences that are only 3 characters in length, and it ran just fine.
Thanks for sharing the problem input file. This is my testing with the latest Biopython,
from Bio import SeqIO
from Bio.Blast import NCBIWWW
for record in SeqIO.parse("permutations91.fa", "fasta"):
print "%s length %i" % record.id, len(record))
result_handle = NCBIWWW.qblast("blastn", "nr", record.format("fasta"), megablast=False)
This is a simple silly script which calls QBLAST but ignores the results. The output:
No errors. I checked the last result and it looked like proper XML. Niek de Klein pointed out some other possible failure causes, but they don't seem to apply here. Potentially your issue is a simple network failure, and a try/except could be used to repeat the query?
However, if you are using this script on large FASTA files, you would be much better off downloading the NR database and standalone BLAST and running this locally.
I had thought it could be a network problem, except the same program works as expected with a different fasta file. I think I will try the local blast though.
You should post one of the smaller files that fail to a location that we can download it from
So where did you find that command line? blastn doesn't work with nr.
You're wrong Michael, blastn DOES work with nr - see below. It is probably treated as an alias for nt, given the NCBI refer to it a "Nucleotide collection (nt/nr)" on the BLASTN website. It is surprising through as NR normally means the protein database.
You're wrong Michael, QBLAST with "blastn" DOES work with nr - see below. It is probably treated as an alias for nt, given the NCBI refer to it a "Nucleotide collection (nt/nr)" on the BLASTN website. It is surprising through as NR normally means the protein database.
I've updated the Biopython Tutorial to use "nt" rather than "nr" to avoid the confusion - thanks for flagging this Michael: https://github.com/biopython/biopython/commit/60fed13c350ab8e3f2e79b69d490b0701a1b2540