So I'm trying to parse some HMMer3.1 results with the biopython SearchIO module and I can't seem to find a field where the parser stores the length of the HMM model? I can get the alignment length, however, I would like to pull out the length of the model. I've probably overlooked something in the documentation as this seems like it should be included.
I think the answer depends on whether your query is the model or the sequences. For hmmscan or hmmsearch, where the model is the query, the result.seq_len method should be what you want, whereas the hsp objects store the alignment results. There is a similar method in bioperl's Search::IO where $result->query_length gets the length of the query sequence/model. The HMMERv3 tests in biopython should help you find the right method and give you a way to test this.
I'm using hmmscan, and from there the
result.seq_len
is the query sequence (i.e. protein sequence). It seems odd to me that you wouldn't be able to get both lengths from either of the searches (i.e. 'hit' and query). I can certainly get the length of the alignment through the HSP hit - but that isn't telling me the length of the HMM model. I can get it with some parsing of the output of hmmfetch, but that feels rather hackish to me. If you use the--domtblout
for hmmscan, one of the column results is the HMM model length, however, it seems that the biopython parser for the domain table skips over that feature.Oh, I was wrong about hmmscan, that is part of the problem. Hmmscan does use the sequence as the query, but hmmsearch does not. Did you try to get the hit length (not the hsp)? That would be the model, not the sequence. I don't see another method to get this, but I'll look again.
This looks to be all there is for hit attributes:
If you look at the test for the hmmscan domtblout format you can see there is a
hit.seq_len
method that gives you the model length. This correlates with the info for each model in Pfam. There are a couple of residues difference in length on Pfam vs. biopython tests, but that is probably just version differences for the models. It seems like theseq_len
method is what you want.Okay, maybe you can only get that value if you use the
--domtblout
. At the moment I was using full output (text output) from hmmscan as I thought that would contain more information, however it seems you cannot parse out the model length or the model accession number if you don't use the--domtblout
of hmmscan. Thanks for the help @SES.I think you are correct, as I don't see that method for the text parser (but I didn't test either). Good to know.