Parsing HMMER output with BioPerl
1
0
Entering edit mode
10.2 years ago
rwn ▴ 610

Dear all,

I'm attempting to parse the standard output from HMMER 'hmmsearch' using BioPerl's Bio::SearchIO module. I want to access some basic information for each HSP, such as %-identity and %-query coverage, but am running into some difficulties. Here is a MWE:

#!/usr/bin/perl -w

use strict;
use Bio::SearchIO;

my $in = Bio::SearchIO->new( -file => $ARGV[0], -format => 'hmmer' );
while ( my $result = $in -> next_result ) {
    ## this is a Bio::Search::Result::HMMERResult object:
    print "Query name: ".$result->query_name."\n";
    ## this should be a Bio::Search::Hsp::HMMERHSP object (but it doesn't behave like one):
    while( my $hit = $result->next_hit ) {
        while( my $hsp = $hit->next_hsp ) {            
            my $percent_id = sprintf("%.2f", $hsp->percent_identity);
            my $percent_q_coverage = sprintf("%.2f", ((($hsp->length('query')/($result->query_length)))*100));
            print "\tHit name: ".$hit->name()."\n";
            print "\tIdentity: ".$percent_id."\%\n";
            print "\tFraction identical: ".$hsp->frac_identical('total')."\n";
            print "\tNumber conserved residues: ".$hsp->num_conserved."\n";
            print "\tQuery length: ".$result->query_length."\n";
            print "\tAlignment length: ".$hsp->length('query')."\n";
            print "\tQuery coverage: $percent_q_coverage\%\n";

            ## schema:
            if ( ($percent_id >= 80) and ($percent_q_coverage >= 80) ) {
                print "\t** Result: PRESENT\n\n";
            } elsif ( ($percent_id >= 80) and ($percent_q_coverage < 80) ) {
                print "\t** Result: TRUNCATED\n\n";
            } else {
                print "\t** Result: ABSENT\n\n";
            }
        }
    }
}

However, this does not return the expected values for % identity, or for other stats like the proportion of identical sites or number of conserved sites (it returns 0 values in all cases), although it does return the correct query length information. A typical output looks something like:

Query name: querySeq_X
    Hit name: hitSeq_Y
    Identity: 0%
    Fraction identical: 0
    Number conserved residues: 0
    Query length: 906
    Alignment length: 906
    Query coverage: 100.00%
    ** Result: ABSENT

Even when I know that querySeq_X and hitSeq_Y are identical (ie., identity should be 100%).

I suspect it's because my script is not accessing the HSP alignments themselves...? I've searched across CPAN etc., but while there are a number of HMMER-related modules available there is not much in the way of working examples for me to figure out what's going on. How can I parse the alignments themselves to access these kinds of data?

Any help would be much appreciated!

Cheers.

HMMER Bioperl • 3.9k views
ADD COMMENT
0
Entering edit mode

I don't see anything wrong with your code, which indicates that there may be an issue with the input. Please tell us the version of BioPerl and HMMER you used because that may be a factor.

ADD REPLY
0
Entering edit mode

Hi SES, I'm using HMMER 3.1b1 and Bioperl 1.6.901 I believe. I think perhaps Michael has a point below - I know that HMMs don't produce an alignment score per se (rather, a sequence score), but I'm surprised you can't get info such as the % identity etc as it clearly does output alignments for each domain hit (with a identical/conserved midline etc.).

A lot of the SearchIO::HowTo documentation talks about these domain alignments being the equivalent (from a parsing ability point of view) to a BLAST HSP, and so should be accessible using standard Bio::SearchIO syntax as my basic script. I also found some of the methods (eg. my @h_ident = $hsp->seq_inds('query', 'identical');) work but others do not. This makes me suspect it is a bug in the BioPerl code.....

ADD REPLY
0
Entering edit mode
10.2 years ago
Michael 55k

I think that %-identity, number of identical bases, etc. do not apply when using a HMM for sequence search.

The closest you could get is the optimal emission probability for the aligned sequence. In an HMM there are not only exact matches and mismatches, but emission probabilities for each symbol in each state.

ADD COMMENT

Login before adding your answer.

Traffic: 1731 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6