I'm running hmmsearch via HMMER 3.3
with a concatenated HMM file. Here is the first 1000 lines of my tigrfams.hmm.gz. It looks like the file is properly merged with the following format:
[HMM]
//
[HMM]
//
I don't know why but I'm only getting a hit for the first HMM TIGR00001
and I have over a million protein sequences. I know there's definitely other hits but it's running quickly. I ran the same type of command on a concatenated PFAM db
and KOFAM db
...both worked :/ However, when I ran a concatenated PANTHER db
I got a similar result as my TIGRFAM
(only one hit).
Here is my command (it's an arrayjob for sungrid engine):
#!/bin/bash
###########
# Variables
###########
__INPUT=tigrfam_output/array_output/sequences/partition_${SGE_TASK_ID}.fasta.gz
__OUTPUT=tigrfam_output/array_output/output/hmmsearch.tigrfam_${SGE_TASK_ID}.tblout
__THREADS=1
#########
# Command
#########
time(hmmsearch -o /dev/null --tblout $__OUTPUT --cut_ga --cpu $__THREADS /usr/local/scratch/METAGENOMICS/jespinoz/db/tigrfam_db/tigrfams.hmm.gz $__INPUT)
What am I doing incorrectly?
What confuses me the most is that I used
hmmsearch
for thePFAM
database and it seemed to work great? What could be different about the TIGRFAM database? They look the same when I open up the merged files.Not sure what the problem is, other than any HMM search will be slow on a large database when using a single thread. Also, nothing is wrong with TIGRFAMs database either.
It may be a good idea to troubleshoot it on a smaller database as you will get the answer faster. I just ran
hmmsearch
with a whole TIGRFAMs database against PDB (~110K sequences). The output was as expected, with all HMMs having hits. That was using 3 CPUs and still it took 32 minutes. I am sure you can do the math how long it would take with a single CPU and 100-200x larger database.Thanks again, I'll run some test examples right now.