Question

Make tblastn run faster

0

Entering edit mode

4.3 years ago

maxime.policarpo ▴ 200

Hi everyone and happy new year.

I am currently trying to use tblastn using approximately 1000 protein sequences against a genome of 35Gb. However, the time needed is very large even when using multiple threads and it usually does not achieve correctly (Core dumped message)

Would anyone have an idea how i could accelerate the tblastn search ? I knwon that blastn can be accelerated using external softwares such as Diamond but tblastn is not implemented in Diamond...

Thanks for any help provided !

Maxime

blast tblastn genome software • 1.9k views

ADD COMMENT • link updated 4.3 years ago by 6schulte ▴ 30 • written 4.3 years ago by maxime.policarpo ▴ 200

0

Entering edit mode

Maybe you are requesting your computer to do too many threads when multi-threading? I have had the issue on my VM. I thought I had allowed the VM to work on 4 but in reality it was only 3. This led to such error messages for me, when I wanted to run four processes simultaneously...

ADD REPLY • link 4.3 years ago by 6schulte ▴ 30

score 2 · Answer 1 · 2021-01-12

2

Entering edit mode

4.3 years ago

lieven.sterck 15k

splitting the input file and/or the DB is likely the only suitable approach to speed this up . Also don't use to many threads per job, that does not pay off, something like ~4 threads/job is near the sweat spot.

If you split the DB as well, don't forget to set the theoretical DB size in the blast job, that way your e-values will still be comparable.

and, wow, a 35Gb genome, I can only think of a few species in that range ;) . good luck!!

ADD COMMENT • link 4.3 years ago by lieven.sterck 15k

0

Entering edit mode

Haha yeah this is a lung fish genome that was recently put on the NCBI genome database (Neoceratodus forsteri).

I will try to split the genome fasta file into one fasta per scaffold and see if I can get something ...

Thanks for the tips and have a good day !

Max

ADD REPLY • link 4.3 years ago by maxime.policarpo ▴ 200

1

Entering edit mode

you don't have to split it up as to a single fasta/scaffold per chunk (unless the sequences are very large) . you're better off splitting them in roughly equal file size, more efficient and you will avoid that the larger sequences run much longer than the shorter ones (== better manageable)

ADD REPLY • link 4.3 years ago by lieven.sterck 15k