tBlastn taking an absurd amount of time
3
0
Entering edit mode
4.7 years ago

Hi all

Context: I´m running a bi-directional best blast hit with a transcriptomic assembly (around 49 000 sequences) and the KEGG ag protein dataset. So I have to run a blastx and then a tblastn and compare the results.

I started by running the blastx and I noticed after an hour or so that my output file was not growing. I thought this was strange so I made a simple script to monitor my blast and how much each sequence takes to be analyzed (this script simply copies a sequence to a temporary file, applies blast, and dumps the output in another file).

As I suspected some sequences got "stuck" and did not return results even after 6 ~ hours of computation. The problem does seem to be in the sequence itself, as using the NCBI Blast service with these sequences and the same database gives results in seconds.

I´m currently running tblastn in a server and I have the same issue. The difference is, when I try to use NCBI blast service I get a bad gateway error 502.

How can I solve this? Ideally, I would like to just need to run my local Blast

here is the code I used:

##Blastx

makeblastdb -in KEGG_agProteins.fasta -out db/KEGG_agProteins -dbtype prot

blastx -query Out2RefSeq.fasta -db db/KEGG_agProteins -outfmt 6 -out output/BLASTxout_idio2_agKO.txt -max_target_seqs 1 -max_hsps 1 -use_sw_tback -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -num_threads 4

## tblastn 

makeblastdb -in C:\Users\Faculdade\Desktop\Dissertação\Dados_Illumina\expFolhas_inOut_Joana\RefSeq_outsideLeaves\Out2RefSeq\Out2RefSeq.fasta -out db\OutLeafrefseq -dbtype nucl


tblastn -query KEGG_agProteins.fasta -db db/OutRefSeq -outfmt 6 -out output/tBLASTn_agKO_OuLeafRefSeq.txt -max_target_seqs 1 -max_hsps 1 -use_sw_tback -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -num_threads 4
BLAST Transcriptomics omics software error • 2.3k views
ADD COMMENT
0
Entering edit mode

I think this needs to be tracked down by investigating both the database and the sequence that "gets stuck".

I believe that the blast server that runs as a web service at NCBI is not the same code that you run at the command line. Both implement the same algorithm of course. Thus it is possible to have bugs and weird behaviours that affect only the command line version.

Put your blast database and the problematic sequences to some location we can download from and we'll test it out.

ADD REPLY
0
Entering edit mode

See if this link works: https://drive.google.com/drive/folders/17fIaYCkmk34thc7pPuS8z0sWHsIJg4h0?usp=sharing ("problematic_sequence.fasta as query and "KEGG_agProteins.fasta" as a database)

ADD REPLY
2
Entering edit mode
4.7 years ago

I got your data and ran it.

The problem seems to be caused by the -use_sw_tback option.

My blast (version 2.9.0) does not list that as a valid parameter, though, it will not complain if I use it. But then just like yours it will fail to produce alignments for these sequences.

ADD COMMENT
0
Entering edit mode

I have the same blast version as you and -use_sw_tback is listed as follows:

-use_sw_tback Compute locally optimal Smith-Waterman alignments?

To be honest, I´m not too sure why this option was picked and I did not know why it was a thing until two days ago. However, I´m using it as a previous analysis in my lab on another dataset also used it, and this way the results can be comparable. And since it says it calculates the local smith-waterman alignments I´m guessing it somehow incorporates this algorithm in the search itself or when reporting the score.

But indeed removing this option gives me results within seconds

ADD REPLY
2
Entering edit mode

Computing locally optimal Smith-Waterman alignments happens after the search is done, not during the search. So that option should not increase the search time at all. However, if you have long query sequences and/or lots of pairwise alignments, this option will take a while because it doesn't use any of the BLAST search heuristics.

Whether you can do without -use_sw_tback depends on what columns you are using from the BLAST output. It will almost certainly change start and end positions of pairwise alignments, but I am guessing that E-values are probably not recalculated and will stay the same. If E-values are all you need, you can test whether they change by running BLAST on the same sequence - something short that will finish quickly - with and without the -use_sw_tback option. Then compare the output and see whether anything that matters to you is different between the two runs.

ADD REPLY
1
Entering edit mode

It is an interesting case, taking just a single query sequence of 2Kb size:

time blastx -query test -db db/KEGG_agProteins -outfmt 6 | wc -l
85

produces 85 alignments and takes 0.2 seconds

time blastx -query test -db db/KEGG_agProteins -outfmt 6 -use_sw_tback | wc -l

already runs for two minutes. Will update on how long it took overall.

ADD REPLY
0
Entering edit mode

I tried to look online for documentation on the use_sw_tback but I could not find anything (not even in the blast+ manual). Do you mind telling me your reference for this info?

I´m interested in three columns in the output: the target sequence, the e-value and the bitscore. I´m running the blastx on my original input sequences (a subset which I know that work) without the -use_sw_tback to see if there is any difference. Since the bitscore is calculated based on the alignment I´m guessing it might change.

ADD REPLY
1
Entering edit mode

Here is a review that explains some of it:

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2001-2-10-reviews2002

Like it says: Like FASTA, BLAST does not allow gaps in the primary word-matching pass, but it does in the subsequent Smith-Waterman alignment stage. With regular BLAST functionality, SW alignments are calculated only on a subset of sequences that pass the initial short-word filter, and that improves the speed. Not absolutely sure, but with -use_sw_tback they are likely calculated for all sequences and that is what takes time. My guess is that both E-values and bit scores will be the same between the two runs, but it is possible that with -use_sw_tback you will recover a small additional fraction of significant hits. I suspect that the extra hits with -use_sw_tback will be less than 1%, but would love to hear the outcome if you actually compare the two comprehensively.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

oh indeed, accidentally I ran blastn not blastx when looking for help on parameters. Looks like I have that option as well.

ADD REPLY
1
Entering edit mode
4.7 years ago

Playing around a bit with the data I found a short subsequence that shows a surprising deterioration of runtime when using the -use_sw_tback parameter..

>L1OTR4035|c0_g1_i1
ggacaacagcagcatcagagatacaaggatgagtgagtaagagagcttcaagttcagcag

for this sequence, the runtime goes from 0.1 seconds to 4.4 minutes (about 3000 times slower). The resulting bitscores and E-values are different as well:

                                                                     Score        E
Sequences producing significant alignments:                          (Bits)     Value

ag:AAB26932 K05914 photinus-luciferin 4-monooxygenase (ATP-hydrol...  27.3       0.011
ag:CAA67248 K16117 pristinamycin I synthase 2 | (KEGG) snbC; pris...  22.7       0.44
ag:CAA67249 K16118 pristinamycin I synthase 3 and 4 | (KEGG) snbD...  22.3       0.61
ag:AAG02359 K16106 nonribosomal peptide synthetase protein BlmVI ...  21.9       0.74
ag:AAF62881 K16395 epothilone synthetase B | (KEGG) epoB; epothil...  21.9       0.91
ag:CAL48953 K22799 2-methyl-4-hydroxyquinoline prenyltransferase ...  21.2       1.6
ag:AAG02364 K16112 nonribosomal peptide synthetase protein BlmIV ...  20.4       2.9
ag:AAC80285 K16125 syringomycin synthetase protein SyrE | (KEGG) ...  20.4       3.4
ag:CAH55637 K21780 L-proline---[L-prolyl-carrier protein] ligase ...  20.0       4.0
ag:CAA67140 K16116 pristinamycin I synthetase 1 | (KEGG) snbA; pr...  20.0       4.5
ag:AFV52191 K21228 (R)-2-aza-beta-tyrosine adenylation enzyme [EC...  19.2       7.3
ag:AFB74617 K20621 N-methylcanadine 1-hydroxylase | (KEGG) CYP82Y...  19.2       8.7
ag:AAG02355 K16108 nonribosomal peptide synthetase protein BlmX |...  18.9       9.9

vs

ag:AAB26932 K05914 photinus-luciferin 4-monooxygenase (ATP-hydrol...  27.3       0.011
ag:AAG02355 K16108 nonribosomal peptide synthetase protein BlmX |...  23.5       0.26
ag:CAA67248 K16117 pristinamycin I synthase 2 | (KEGG) snbC; pris...  22.7       0.44
ag:CAA67249 K16118 pristinamycin I synthase 3 and 4 | (KEGG) snbD...  22.3       0.61
ag:AAG02359 K16106 nonribosomal peptide synthetase protein BlmVI ...  21.9       0.74
ag:AAF62881 K16395 epothilone synthetase B | (KEGG) epoB; epothil...  21.9       0.91
ag:CAL48953 K22799 2-methyl-4-hydroxyquinoline prenyltransferase ...  21.2       1.6
ag:AAG02364 K16112 nonribosomal peptide synthetase protein BlmIV ...  20.4       2.9
ag:AAC80285 K16125 syringomycin synthetase protein SyrE | (KEGG) ...  20.4       3.4
ag:CAH55637 K21780 L-proline---[L-prolyl-carrier protein] ligase ...  20.0       4.0
ag:CAA67140 K16116 pristinamycin I synthetase 1 | (KEGG) snbA; pr...  20.0       4.5
ag:AFV52191 K21228 (R)-2-aza-beta-tyrosine adenylation enzyme [EC...  19.2       7.3
ag:AFB74617 K20621 N-methylcanadine 1-hydroxylase | (KEGG) CYP82Y...  19.2       8.7
ADD COMMENT
0
Entering edit mode

Note that the only true difference in the output is for one sequence. It goes from:

ag:AAG02355 K16108 nonribosomal peptide synthetase protein BlmX |...  18.9       9.9

to

ag:AAG02355 K16108 nonribosomal peptide synthetase protein BlmX |...  23.5       0.26
ADD REPLY
1
Entering edit mode
4.7 years ago

Okay, adding a little bit more into Istvan response, I found differences in just 263 out of 49 000 sequences (when comparing hit sequence, bitscore, and e.value). As Mensur predicted, <1% of the results were affected and the blast finished in just a couple hours, compared with several days it took me on my first try.

Here are the first 5 or so different results. As you can see there are differences in e.values, bitscores, and even the target sequences, which was surprising to me.

results are differenet in the L1OTR710|c0_g2_i1
original file: ag:AAC78591 2.20e-74 266
modified_file ag:AAC78591 3.10e-67 244
the results are differenet in the L1OTR2341|c0_g1_i1
original file: ag:AHF22090 2.93e-14 71.2
modified_file ag:BAG68929 2.22e-12 47.4
the results are differenet in the L1OTR2341|c0_g2_i1
original file: ag:BAG68929 2.03e-13 51.2
modified_file ag:AHF22090 3.64e-19 64.7
the results are differenet in the L1OTR2364|c0_g1_i1
original file: ag:ACZ66247 8.86e-32 130
modified_file ag:ACZ66247 2.31e-31 128
the results are differenet in the L1OTR2953|c0_g1_i1
original file: ag:AAC78591 1.07e-38 149
modified_file ag:AAC15779 1.11e-38 149

Meanwhile, in my tblastn some sequences are really really big and some have even been running one day or more. I might need to remove the use_sw_tback option since it affects the results soo little and might be the only way to get results in a feasible amount of time.

Edit: I just found a papper evaluating the performance of blast with SW algorithm: https://pubmed.ncbi.nlm.nih.gov/18042555/ The conclusion sums it up pretty well:

Based on our results, the recommended parameters for the best detection of orthologs as reciprocal best hits is the combination of soft filtering with a Smith–Waterman final alignment (the -F ‘‘m S’’ -s T options in NCBI’s BLASTP). These options resulted in both the highest number of orthologs and the minimal error rates. However, most of the improvement can be achieved using soft filtering alone.

ADD COMMENT

Login before adding your answer.

Traffic: 2723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6