I've seen that HMMER can be sped up with GNU Parallel: Speed of hmmsearch
I have around 100,000 sequences and a HMMER database of around 300 HMM profiles.
I’m running everything at once but I’m wondering if it’ll be faster to split up the sequences and/or split up the jobs.
There’s a few ways to do it, if I'm working with 16 threads:
Should I split up the sequences equal to the number of threads and run in parallel with 1 thread each? For example, split sequences into 16 files and then running 16 jobs at 1 thread?
Should I split up into around 1000 files and then run HMMSearch for each of them one at a time but at full 16 threads?
Should I split up into 1000 files, then run 16 of them at a time, each with 1 thread?
What’s the best way to do it?
Are there any wrapper that do this for you?
If I wanted to do option 1, how would I do this?
For simplicity, instead of 16 threads, let's just use 4 threads.
I have my large fasta file (in reality it's a lot of different fasta files of different size that I cat
together),
# The pipeline I'm working on uses seqkit as the fasta parsing and manipulation package
cat *.faa | seqkit split -p 4 -O proteins/
In proteins/
there will be 4 fasta files:
stdin.part_001.fasta, stdin.part_002.fasta, stdin.part_003.fasta, stdin.part_004.fasta
Here's my marker databases:
DB_1=MarkerSet1/genes.hmm.gz
DB_2=MarkerSet2/genes.hmm.gz
DB_3=MarkerSet3/genes.hmm.gz
How would I run hmmsearch
with GNU Parallel
using each of these 4 fasta files that I split up?
I'm trying to adapt this code here from a previous BioStars post to my situation with the multiple files:
function hmmer() {
n=$(basename "$1")
hmmsearch -Z 2000000 --domZ 89 --cpu 1 -o $1.output.hmmsearch.txt stockholm89.hmm $1
}
export -f hmmer
find /where/the/split/files/are/ -maxdepth 1 -type f -name "*specific2splitFiles" | parallel -j 20 hmmer {}
I'm expecting an output like this but I don't understand how to adapt the parallel
command:
stdin.part_001.MarkerSet1.tblout
stdin.part_002.MarkerSet1.tblout
stdin.part_003.MarkerSet1.tblout
stdin.part_004.MarkerSet1.tblout
stdin.part_001.MarkerSet2.tblout
stdin.part_002.MarkerSet2.tblout
stdin.part_003.MarkerSet2.tblout
stdin.part_004.MarkerSet2.tblout
stdin.part_001.MarkerSet3.tblout
stdin.part_002.MarkerSet3.tblout
stdin.part_003.MarkerSet3.tblout
stdin.part_004.MarkerSet3.tblout