Question

Conflicting consensus sequences from different tools (bug in hmmemit?)

1

Entering edit mode

8.2 years ago

Joe 22k

I'm in the process of creating concensus sequences, and my specific aim is to remove any ambiguous characters (for reasons that are unimportant to the question).

I couldn't find a simple tool that would do what I want satisfactorily, so I started writing my own (until someone reminded me about HMMs but more on that is a second).

I have the following MSA as a AlignIO object:

>>> print(msa)
SingleLetterAlphabet() alignment with 16 rows and 149 columns
MSTTPEQIAVEYPIPTYRFVVSLGDEQIPFNSVSGLDISHDVIE...QAA PAU_02775
MSTTPEQIAVEYPIPTYRFVVSIGDEQIPFNSVSGLDISHDVIE...QAA PLT_01696
MSTTPEQIAVEYPIPTYRFVVSIGDEQVPFNSVSGLDISHDVIE...QAA PAK_02606
MSTTPEQIAVEYPIPTYRFVVSIGDEKVPFNSVSGLDISHDVIE...QAA PLT_01736
MTTTT----VDYPIPAYRFVVSVGDEQIPFNNVSGLDITYDVIE...QAA PAK_01896
MATTT----VDYPIPAYRFVVSVGDEQIPFNSVSGLDITYDVIE...QAA PAU_02074
MSVTTEQIAVDYPIPTYRFVVSVGDEQIPFNNVSGLDITYDVIE...QAA PLT_02424
MTITPEQIAVDYPIPAYRFVVSVGDEKIPFNNVSGLDVHYDVIE...QAP PLT_01716
MAITPEQIAVEYPIPTYRFVVSVGDEQIPFNNVSGLDVHYDVIE...QAA PLT_01758
MSTSTSQIAVEYPIPVYRFIVSIGDDQIPFNSVSGLDINYDTIE...QAV PAK_03203
MSTSTSQIAVEYPIPVYRFIVSVGDEKIPFNSVSGLDISYDTIE...QAV PAU_03392
MSITQEQIAAEYPIPSYRFMVSIGDVQVPFNSVSGLDRKYEVIE...QVP PAK_02014
MSITQEQIAAEYPIPSYRFMVSIGDVQVPFNSVSGLDRKYEVIE...QVP PAU_02206
MSTTADQIAVQYPIPTYRFVVTIGDEQMCFQSVSGLDISYDTIE...EFH PAK_01787
MSTTADQIAVQYPIPTYRFVVTIGDEQMCFQSVSGLDISYDTIE...EFH PAU_01961
MSTTVDQIAVQYPIPTYRFVVTVGDEQMSFQSVSGLDISYDTIE...EFH PLT_02568
                      *

(Note the asterisk I've added for the moment).

If I build a concensus with hmmemit -c ("majority-rule consensus sequence"), I get the following sequence:

>hmmemit-consensus    *
MSTTAEQIAVEYPIPTYRFVVSVGDEQIPFNSVSGLDISYDVIEYKDGVGNYYKMPGQRQ
AINITLRKGVFSGDTKLFDWINSIQLNQVEKKDISISLTNEAGTEILLTWSVANAFPTSL
TSPSFDATSNEVAVQEISLTADRVTIQAA

At column 23 (22 if zero based) (the asterisk), hmmemit places a Valine (V). My own tool however, places an Isoleucine (L). Part of my process is to use collections.Counter to score the string, and in that column, I occurs 8 times, and V 7 times. Why is hmmemit choosing a lower frequency amino acid?

Basic process of my script (relevant parts only):

def enumerate_string(string):
    """Returns the most common characters of a string. Multiple characters are returned if there are equally frequent characters."""

    from collections import Counter
    counts = Counter(string)

    keys = []
    for key, value in counts.iteritems():
        if value == max(counts.values()):
            keys.append(key)

    return keys

>>> msa = AlignIO.read('myalignment.fasta', 'fasta')
>>> msa_summary = AlignInfo.SummaryInfo(msa)

 >>> Counter(msa_summary.get_column(22))
Counter({'I': 8, 'V': 7, 'L': 1})
>>> enumerate_string(msa_summary.get_column(22))
['I']

This clearly shows that Isoleucine is the most common, so why doesn't hmmemit choose it?

hmm consensus bipython python • 2.7k views

ADD COMMENT • link updated 8.2 years ago by cryptogenomicon ▴ 160 • written 8.2 years ago by Joe 22k

score 4 · Accepted Answer · 2017-08-30

4

Entering edit mode

8.2 years ago

cryptogenomicon ▴ 160

Because hmmemit is emitting the maximum likelihood sequence, according to the profile HMM's parameterization.

("majority-rule" in hmmemit means w.r.t. the profile HMM, not w.r.t. the MSA that the HMM was built from. hmmemit takes the HMM as input, not the MSA.)

ADD COMMENT • link 8.2 years ago by cryptogenomicon ▴ 160

0

Entering edit mode

Thanks Sean, that makes a lot of sense. I figured it was probably taking in to consideration positions in the alignment other than just the column in question, but evidently misunderstood what the majority rule was exactly doing.

Just to add a small follow up question, it's important for what I'm doing that there are no ambiguous positions in the final consensus sequence. Does hmmemit always call an amino acid/base or will it also output ambiguous positions?