Question

Blastn, need help to increase speed

2

Entering edit mode

4.5 years ago

chiachoong_leong93 ▴ 20

Hi all, I am facing some difficulties in blasting my de novo assembled unigenes.

I have about 85000 unigenes and was planning to blast it against the nt preformatted database from the ncbi ftp link.

I used this command

to download and check whether the db is up to date

  # update_blastdb.pl --passive --decompress nt

to blast my query (query3.fa which only have 4 sequence) against the 90+ GB nt database

 # blastn -db nt -query query3.fa -task blastn -dust no -outfmt "6 delim=, qacc stitle sacc evalue bitscore qcovus pident" -max_target_seqs 1 -num_threads 4 -out results.txt

my CPU is intel i5-7300hq which has 4 cores and thread, 8gb ram

However, the time taken to blast only this 4 sequence took about 30 minutes, and my whole sequence is about 85000. It would probably take about 1.5 years for me to fully blast all my sequence at this rate.

Is there no other way to speed this up other that using a more powerful CPU?

Will formatting my query file or even using the fasta version of nt database will help?

This is how my query file look like (I have already deleted a big portion of the sequence to show here)

>H42_1_(paired,_trimmed_pairs)_contig_1_consensus
CATCACCTCCAAGATCCGGCTTGTGAATTCAACTTGTCGCCCGGAGGCTTCCCAAATTCT
TAGACTGCGCGCCTGCCTAAGCCAGCTACCTAACAATATACCACTCTCATTGCACTCAAT
GATGTCTGCAGAGTCGGCGCGCTG

>H42_1_(paired,_trimmed_pairs)_contig_2_consensus
GCAGAACCGAGCTTCAAGCTCCAAGATCCGGCTTTTGAATTCAACTTGTCGCCTGGAGGC
TTCCCAAATTCTTAGACTGCGCGCCTGCCTGAGCCAGCTACTTAACAATATACCACCCCC
ATTGAACTCAATGATGTCTCAATCGAACGTGTAAGGCTTGGAGCTTGGAGCTTGAAGCTC
GGTTC

>H42_1_(paired,_trimmed_pairs)_contig_3_consensus
GAGGAATATGAATCCGGATAACAATATTACAATGATGCGATGTTTAACTGCTACTGCCTC
TTAACTATCAACGTCTACATAC

>H42_1_(paired,_trimmed_pairs)_contig_4_consensus
ACCGCCGGATGGGTCTGCAGAGAGGTTAACGAAAGTCGGTGCGGAGACGCCTTTCTCGCC
GCCGATA

Thank you very much in advance!

RNA-Seq blastn blast+ • 15k views

ADD COMMENT • link updated 15 months ago by Dunois ★ 2.9k • written 4.5 years ago by chiachoong_leong93 ▴ 20

0

Entering edit mode

Your contigs look rather short, I almost had to look up unigene, but these are de-novo assembled transcripts, correct? I think you might want to check if you can improve the assembly to increase the length of the contigs and the reduce their number, this might not save you that much time, but make your result more informative. Then you should ask if performing BlastN is very informative because you will only catch very similar sequences, and waiting 1.5 years for that is maybe not worth it :) I would prefer getting alignments on the amino-acid level, but then you need BlastX - or Diamond - vs NR. With Diamond you might even be able to finish the job on your Hardware, for Blast you need either a cluster or at least a multi-core machine, or a large cloud instance (will be expensive).

ADD REPLY • link 4.5 years ago by Michael 56k

0

Entering edit mode

Your contigs look rather short,

@Michael OP has said.

I have already deleted a big portion of the sequence to show here

ADD REPLY • link 4.5 years ago by GenoMax 153k

0

Entering edit mode

Ok, sorry, I didn't get that. Still it might be better to use blastx, diamond or a pipeline like trinotate.

ADD REPLY • link 4.5 years ago by Michael 56k

0

Entering edit mode

Have you tried reducing your query set by clustering it some high threshold to see how many representatives remain?

ADD REPLY • link 4.5 years ago by 5heikki 11k

score 4 · Answer 1 · 2021-02-22

With your hardware, specifically with your low memory, there is no way to make this substantially faster - see a recent discussion here on a similar topic. This will hold regardless of which program you use, because nt is a gigantic database and it will not fit into the memory you have, which translates into lots of disk swapping.

You already have many good suggestions, so I will add a couple that were not mentioned.

Use something like average nucleotide identity (ANI) to quickly compare your sequences with a collection of genomes. See an example here. It will not give you an answer on a per-sequence basis, but it will identify what collections of sequences are most similar to yours on a global level.
Use hashing algorithms for the comparison - see here and here for details.

To give you better suggestions - and there are other options - you would need to provide more details of what your sequences are and what is the minimum amount of information you are hoping to get. For example, there are other strategies to employ if you predict genes from your sequences and search with proteins instead.

score 2 · Answer 2 · 2021-02-22

So, I think there are a few steps you can take anyway:

Identify the right search strategy for your application, likely BlastN or BlastX. One could argue that it is required to use both, but if your resources are limited, you might get more from a BlastX run in this case, even though it might run for even longer.
If your search strategy is BlastX, then you can use DIAMOND on GhostX as a replacement. This is the only approach that will work on desktop hardware.
Even if you have enough resources, like a 50+ CPU cluster and want to run NCBI blast, it still pays off to optimize the search: database size matters, so if you have a eukaryote you can at least throw out bacterial taxa and vice versa, or even more. Of course that also has its draw backs, like not detecting contaminants.
For BlastN, you can further use the task "megablast" that will speed up your search but only find highly similar matches.
Using GNU parallel might further speed up your query over simply using -num_threads see Truly Parallel Blasts With Blast+ for further links but your milage may vary.

Finally, your estimate of 1.5 years to complete does not take into account the significant startup-time required for loading the NT/NR Blast database. So, in the end the whole search might be a bit more efficient, but it definitely still take too long.

score 1 · Answer 3 · 2022-02-05

1

Entering edit mode

3.6 years ago

Michael 56k

More software intended to speed up Blast searches:

CrocoBLAST Paper Download
High Speed BlastN Paper Download (uses BWT and builds FMD index, not sure this will work with the whole NT database)

Thus, there are options for speeding up sequence searching. One idea is to index the database, unlike Blast, using e.g. suffix-arrays or BWT with index can all drastically speed up searches. However, the question then is: is it possible to create the index and does it fit in memory? Because suffix arrays can be sorted in O(n) time and space, it should be, in principle, possible to create and use such an index on a server with ~1TB of memory.

Another option is to use GPU accelerated Blast, like G-BLASTN Paper If you have supported hardware, it could significantly speed up searches.

ADD COMMENT • link 3.6 years ago by Michael 56k

1

Entering edit mode

I'm surprised nobody's suggested MMseqs2. It can do nearly everything BLAST can but much faster (albeit at the cost of some loss in sensitivity).

ADD REPLY • link 3.6 years ago by Dunois ★ 2.9k

1

Entering edit mode

NCBI is planning to use indexes created by MMSeq2 for web searches using nr (NCBI looking for testers for a new web-only (for now) clustered `nr` database ) . BLAST remains popular since NCBI provides free public infrastructure to run searches.

Note: No software is going to be able to address hardware limitation evident in the original post. That includes MMseq2.

Have you replaced blast with MMseqs2 in your own workflows?

ADD REPLY • link 3.6 years ago by GenoMax 153k

0

Entering edit mode

That's really awesome. I really hope they release data sets clustered at different thresholds à la UniRef from UniProt.

And I agree, a quad core laptop processor and 8 gigs of RAM just won't cut it. I just wanted to mention MMseqs2 here since nearly every other alternative has been covered here, and it's likely this thread will keep popping up in search results in the future.

Have you replaced blast with MMseqs2 in your own workflows?

I actually have. I haven't used BLAST in a long while now apart for the occasional quick search via their web interface. MMseqs2 is really convenient for me since I tend to do a lot of reciprocal searches, and it happens to offer a handy easy-rbh sub-command for this purpose.

ADD REPLY • link 3.6 years ago by Dunois ★ 2.9k

0

Entering edit mode

mmseq2 is indeed a good acceleration solution, but its index files require ~ 6 T space....

ADD REPLY • link 15 months ago by m13113153781 • 0

0

Entering edit mode

What databases are you using that are creating such large index files?

ADD REPLY • link 15 months ago by Dunois ★ 2.9k

score 0 · Answer 4 · 2021-02-22

0

Entering edit mode

4.5 years ago

GenoMax 153k

Unfortunately there is no way to speed this up with the hardware you have. If you have 85K sequences you should find alternate hardware. If you are working with a specific species then find genome of a close relative to cut down on the search space.

ADD COMMENT • link 4.5 years ago by GenoMax 153k