Question

Biopython SearchIO HMMer 3 parsing, HMM model length?

0

Entering edit mode

9.3 years ago

Jon ▴ 360

So I'm trying to parse some HMMer3.1 results with the biopython SearchIO module and I can't seem to find a field where the parser stores the length of the HMM model? I can get the alignment length, however, I would like to pull out the length of the model. I've probably overlooked something in the documentation as this seems like it should be included.

biopython • 7.5k views

ADD COMMENT • link updated 14 months ago by jen ▴ 10 • written 9.3 years ago by Jon ▴ 360

0

Entering edit mode

I think the answer depends on whether your query is the model or the sequences. For hmmscan or hmmsearch, where the model is the query, the result.seq_len method should be what you want, whereas the hsp objects store the alignment results. There is a similar method in bioperl's Search::IO where $result->query_length gets the length of the query sequence/model. The HMMERv3 tests in biopython should help you find the right method and give you a way to test this.

ADD REPLY • link 9.3 years ago by SES 8.6k

0

Entering edit mode

I'm using hmmscan, and from there the result.seq_len is the query sequence (i.e. protein sequence). It seems odd to me that you wouldn't be able to get both lengths from either of the searches (i.e. 'hit' and query). I can certainly get the length of the alignment through the HSP hit - but that isn't telling me the length of the HMM model. I can get it with some parsing of the output of hmmfetch, but that feels rather hackish to me. If you use the --domtblout for hmmscan, one of the column results is the HMM model length, however, it seems that the biopython parser for the domain table skips over that feature.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Jon ▴ 360

0

Entering edit mode

Oh, I was wrong about hmmscan, that is part of the problem. Hmmscan does use the sequence as the query, but hmmsearch does not. Did you try to get the hit length (not the hsp)? That would be the model, not the sequence. I don't see another method to get this, but I'll look again.

ADD REPLY • link 9.3 years ago by SES 8.6k

0

Entering edit mode

This looks to be all there is for hit attributes:

bias*            hit-level bias
bitscore         hit-level score
description      hit sequence description
domain_exp_num*  expected number of domains in the hit (exp column)
domain_obs_num   observed number of domains in the hit (N column)
evalue           hit-level e-value
id               hit sequence ID
is_included*     boolean, whether the hit is in the inclusion threshold or not

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Jon ▴ 360

0

Entering edit mode

If you look at the test for the hmmscan domtblout format you can see there is a hit.seq_len method that gives you the model length. This correlates with the info for each model in Pfam. There are a couple of residues difference in length on Pfam vs. biopython tests, but that is probably just version differences for the models. It seems like the seq_len method is what you want.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by SES 8.6k

0

Entering edit mode

Okay, maybe you can only get that value if you use the --domtblout. At the moment I was using full output (text output) from hmmscan as I thought that would contain more information, however it seems you cannot parse out the model length or the model accession number if you don't use the --domtblout of hmmscan. Thanks for the help @SES.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Jon ▴ 360

0

Entering edit mode

I think you are correct, as I don't see that method for the text parser (but I didn't test either). Good to know.

ADD REPLY • link 9.3 years ago by SES 8.6k

Ram · Answer 1 · 2015-12-21

3

Entering edit mode

9.3 years ago

Jon ▴ 360

In case anybody else runs into this issue, here is the "answer". So you cannot get the HMM model accession or length from hmmscan output if you use the standard text output, you must use the --domtblout option in order to recoup this information, here is how you can parse those results with Biopython, I'm running hmmscan through the subprocess module and then parsing the output file.

import subprocess
from Bio import SearchIO

subprocess.call(['hmmscan', '--domtblout', 'output.txt', '/location/to/profile.hmm', 'input.proteins.fasta']
#now parse the output
with open('output.txt, 'rU') as input:
    for qresult in SearchIO.parse(input, 'hmmscan3-domtab'):
        query_id = qresult.id  #sequence ID from fasta
        query_len = qresult.seq_len
        hits = qresult.hits
        num_hits = len(hits)
        if num_hits > 0:
            for i in range(0,num_hits): 
                hit_evalue = hits[i].evalue #evalue
                hmm_name = hits[i].accession
                hmm_length = hits[i].seq_len
                print query_id, query_len, hit_evalue, hmm_name, hmm_length

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Jon ▴ 360

0

Entering edit mode

Hi Jon,

Indeed that's the correct solution. You can only get the hit's length (HMM in this case) when using the domain table format. It should have been clearer in the documentation, I admit. I've just updated the documentation so future versions should have this corrected.

Thanks for posting the solution for now :).

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by bow ▴ 790

0

Entering edit mode

Hi,

i'm parsing my domtblout results using the same script, except i got an assert error : assert len(cols) == 23 for every last hit. Do you have any idea on how to fix it ?

ADD REPLY • link 7.9 years ago by CrLs ▴ 10

0

Entering edit mode

Hi, thank you for this- I was successfully able to parse the output file. How would I modify this script to add the target names to the output file? For example, I would like to parse out these IDs (I am also attaching a screenshot of my domtblout file below for reference):

66. alnta.divvy_RDRP_0.25_325-550. full
107. alnta. divvy_RDRP_0.25_1-310. full
107. alnta.divvy_RDRP_0.25_1-310.full
2046.alnta. divvy_RDRP_0.25_1-1279. full
2078.alnta.divvy_RDRP_0.25_1-1174.full
2078.alnta.divvy_RDRP_0.25_1-1174.full
1796.alnta.divvy_RDRP_0.25_950-1061.full

Essentially, I'd like the first 'column' of the domtblout file to be included in the parsed output. I've been messing around with the code for a while now and cannot seem to call those IDs properly and accidentally include other undesired info in the output. Thanks in advance!

My domtblout file from hmmscan

ADD REPLY • link 14 months ago by jen ▴ 10

0

Entering edit mode

I figured it out! Just needed to add target_name = hits[i].id

ADD REPLY • link 14 months ago by jen ▴ 10