Hi all
Context: I´m running a bi-directional best blast hit with a transcriptomic assembly (around 49 000 sequences) and the KEGG ag protein dataset. So I have to run a blastx and then a tblastn and compare the results.
I started by running the blastx and I noticed after an hour or so that my output file was not growing. I thought this was strange so I made a simple script to monitor my blast and how much each sequence takes to be analyzed (this script simply copies a sequence to a temporary file, applies blast, and dumps the output in another file).
As I suspected some sequences got "stuck" and did not return results even after 6 ~ hours of computation. The problem does seem to be in the sequence itself, as using the NCBI Blast service with these sequences and the same database gives results in seconds.
I´m currently running tblastn in a server and I have the same issue. The difference is, when I try to use NCBI blast service I get a bad gateway error 502.
How can I solve this? Ideally, I would like to just need to run my local Blast
here is the code I used:
##Blastx
makeblastdb -in KEGG_agProteins.fasta -out db/KEGG_agProteins -dbtype prot
blastx -query Out2RefSeq.fasta -db db/KEGG_agProteins -outfmt 6 -out output/BLASTxout_idio2_agKO.txt -max_target_seqs 1 -max_hsps 1 -use_sw_tback -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -num_threads 4
## tblastn
makeblastdb -in C:\Users\Faculdade\Desktop\Dissertação\Dados_Illumina\expFolhas_inOut_Joana\RefSeq_outsideLeaves\Out2RefSeq\Out2RefSeq.fasta -out db\OutLeafrefseq -dbtype nucl
tblastn -query KEGG_agProteins.fasta -db db/OutRefSeq -outfmt 6 -out output/tBLASTn_agKO_OuLeafRefSeq.txt -max_target_seqs 1 -max_hsps 1 -use_sw_tback -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -num_threads 4
I think this needs to be tracked down by investigating both the database and the sequence that "gets stuck".
I believe that the blast server that runs as a web service at NCBI is not the same code that you run at the command line. Both implement the same algorithm of course. Thus it is possible to have bugs and weird behaviours that affect only the command line version.
Put your blast database and the problematic sequences to some location we can download from and we'll test it out.
See if this link works: https://drive.google.com/drive/folders/17fIaYCkmk34thc7pPuS8z0sWHsIJg4h0?usp=sharing ("problematic_sequence.fasta as query and "KEGG_agProteins.fasta" as a database)