Question

Help speeding up HMMER's HMMSearch algorithm for large fasta file with GNU Parallel

0

Entering edit mode

3.5 years ago

O.rka ▴ 740

I've seen that HMMER can be sped up with GNU Parallel: Speed of hmmsearch

I have around 100,000 sequences and a HMMER database of around 300 HMM profiles.

I’m running everything at once but I’m wondering if it’ll be faster to split up the sequences and/or split up the jobs.

There’s a few ways to do it, if I'm working with 16 threads:

Should I split up the sequences equal to the number of threads and run in parallel with 1 thread each? For example, split sequences into 16 files and then running 16 jobs at 1 thread?
Should I split up into around 1000 files and then run HMMSearch for each of them one at a time but at full 16 threads?
Should I split up into 1000 files, then run 16 of them at a time, each with 1 thread?

What’s the best way to do it?

Are there any wrapper that do this for you?

If I wanted to do option 1, how would I do this?

For simplicity, instead of 16 threads, let's just use 4 threads.

I have my large fasta file (in reality it's a lot of different fasta files of different size that I cat together),

# The pipeline I'm working on uses seqkit as the fasta parsing and manipulation package
cat *.faa | seqkit split -p 4 -O proteins/

In proteins/ there will be 4 fasta files:

stdin.part_001.fasta, stdin.part_002.fasta, stdin.part_003.fasta, stdin.part_004.fasta

Here's my marker databases:

DB_1=MarkerSet1/genes.hmm.gz
DB_2=MarkerSet2/genes.hmm.gz
DB_3=MarkerSet3/genes.hmm.gz

How would I run hmmsearch with GNU Parallel using each of these 4 fasta files that I split up?

I'm trying to adapt this code here from a previous BioStars post to my situation with the multiple files:

function hmmer() {
    n=$(basename "$1")
    hmmsearch -Z 2000000 --domZ 89 --cpu 1 -o $1.output.hmmsearch.txt stockholm89.hmm $1
}

export -f hmmer
find /where/the/split/files/are/ -maxdepth 1 -type f -name "*specific2splitFiles" | parallel -j 20 hmmer {}

I'm expecting an output like this but I don't understand how to adapt the parallel command:

stdin.part_001.MarkerSet1.tblout
stdin.part_002.MarkerSet1.tblout
stdin.part_003.MarkerSet1.tblout
stdin.part_004.MarkerSet1.tblout
stdin.part_001.MarkerSet2.tblout
stdin.part_002.MarkerSet2.tblout
stdin.part_003.MarkerSet2.tblout
stdin.part_004.MarkerSet2.tblout
stdin.part_001.MarkerSet3.tblout
stdin.part_002.MarkerSet3.tblout
stdin.part_003.MarkerSet3.tblout
stdin.part_004.MarkerSet3.tblout

protein parallel hmmer hmm bash • 2.8k views

ADD COMMENT • link updated 19 months ago by lieven.sterck 15k • written 3.5 years ago by O.rka ▴ 740

score 2 · Answer 1 · 2023-06-12

Late to the party but just for future reference then :

hmmsearch, though indeed multi-threaded will already max out on 2-3 cpu , as the process is very very I/O intensive the read/write will be the bottleneck indeed (as Mensur Dlakic pointed out already) Anything above that will likely slow down your performance significantly (rather than speeding it up)

so your best chance op speeding this up is split you input in multiple chuncks and run each chunk on 2 cpu (== you can run 8 chunks on you 16 cpu machine thus)

more info: check here : https://github.com/EddyRivasLab/hmmer/issues/161

score 1 · Answer 2 · 2021-08-11

I haven't tried this exact exercise, but from previous experience I don't think options 1 & 3 will work well. Both of them would have 16 independent search processes running, which also means 16 read/write processes. Unless you have an ultra-fast disk - and even if you do - that will most likely be the choking point.

Option 2 would run only one read/write process at a time so there won't be a problem like above. But why split everything into 1000 chunks? hmmsearch is multithreaded, so simply use the maximum --cpu and run everything at once - all HMMs and all your sequences with a single command.

As always, you can take a small subset of your data, say 5-10,000 sequences, and try these options with your architecture.