I am currently trying to use tblastn using approximately 1000 protein sequences against a genome of 35Gb. However, the time needed is very large even when using multiple threads and it usually does not achieve correctly (Core dumped message)
Would anyone have an idea how i could accelerate the tblastn search ? I knwon that blastn can be accelerated using external softwares such as Diamond but tblastn is not implemented in Diamond...
Maybe you are requesting your computer to do too many threads when multi-threading? I have had the issue on my VM. I thought I had allowed the VM to work on 4 but in reality it was only 3. This led to such error messages for me, when I wanted to run four processes simultaneously...
splitting the input file and/or the DB is likely the only suitable approach to speed this up . Also don't use to many threads per job, that does not pay off, something like ~4 threads/job is near the sweat spot.
If you split the DB as well, don't forget to set the theoretical DB size in the blast job, that way your e-values will still be comparable.
and, wow, a 35Gb genome, I can only think of a few species in that range ;) . good luck!!
you don't have to split it up as to a single fasta/scaffold per chunk (unless the sequences are very large) . you're better off splitting them in roughly equal file size, more efficient and you will avoid that the larger sequences run much longer than the shorter ones (== better manageable)
Maybe you are requesting your computer to do too many threads when multi-threading? I have had the issue on my VM. I thought I had allowed the VM to work on 4 but in reality it was only 3. This led to such error messages for me, when I wanted to run four processes simultaneously...