I'm in the process of creating concensus sequences, and my specific aim is to remove any ambiguous characters (for reasons that are unimportant to the question).
I couldn't find a simple tool that would do what I want satisfactorily, so I started writing my own (until someone reminded me about HMMs but more on that is a second).
I have the following MSA as a AlignIO object:
>>> print(msa)
SingleLetterAlphabet() alignment with 16 rows and 149 columns
MSTTPEQIAVEYPIPTYRFVVSLGDEQIPFNSVSGLDISHDVIE...QAA PAU_02775
MSTTPEQIAVEYPIPTYRFVVSIGDEQIPFNSVSGLDISHDVIE...QAA PLT_01696
MSTTPEQIAVEYPIPTYRFVVSIGDEQVPFNSVSGLDISHDVIE...QAA PAK_02606
MSTTPEQIAVEYPIPTYRFVVSIGDEKVPFNSVSGLDISHDVIE...QAA PLT_01736
MTTTT----VDYPIPAYRFVVSVGDEQIPFNNVSGLDITYDVIE...QAA PAK_01896
MATTT----VDYPIPAYRFVVSVGDEQIPFNSVSGLDITYDVIE...QAA PAU_02074
MSVTTEQIAVDYPIPTYRFVVSVGDEQIPFNNVSGLDITYDVIE...QAA PLT_02424
MTITPEQIAVDYPIPAYRFVVSVGDEKIPFNNVSGLDVHYDVIE...QAP PLT_01716
MAITPEQIAVEYPIPTYRFVVSVGDEQIPFNNVSGLDVHYDVIE...QAA PLT_01758
MSTSTSQIAVEYPIPVYRFIVSIGDDQIPFNSVSGLDINYDTIE...QAV PAK_03203
MSTSTSQIAVEYPIPVYRFIVSVGDEKIPFNSVSGLDISYDTIE...QAV PAU_03392
MSITQEQIAAEYPIPSYRFMVSIGDVQVPFNSVSGLDRKYEVIE...QVP PAK_02014
MSITQEQIAAEYPIPSYRFMVSIGDVQVPFNSVSGLDRKYEVIE...QVP PAU_02206
MSTTADQIAVQYPIPTYRFVVTIGDEQMCFQSVSGLDISYDTIE...EFH PAK_01787
MSTTADQIAVQYPIPTYRFVVTIGDEQMCFQSVSGLDISYDTIE...EFH PAU_01961
MSTTVDQIAVQYPIPTYRFVVTVGDEQMSFQSVSGLDISYDTIE...EFH PLT_02568
*
(Note the asterisk I've added for the moment).
If I build a concensus with hmmemit -c
("majority-rule consensus sequence"), I get the following sequence:
>hmmemit-consensus *
MSTTAEQIAVEYPIPTYRFVVSVGDEQIPFNSVSGLDISYDVIEYKDGVGNYYKMPGQRQ
AINITLRKGVFSGDTKLFDWINSIQLNQVEKKDISISLTNEAGTEILLTWSVANAFPTSL
TSPSFDATSNEVAVQEISLTADRVTIQAA
At column 23 (22 if zero based) (the asterisk), hmmemit
places a Valine (V
). My own tool however, places an Isoleucine (L
).
Part of my process is to use collections.Counter
to score the string, and in that column, I
occurs 8 times, and V
7 times. Why is hmmemit
choosing a lower frequency amino acid?
Basic process of my script (relevant parts only):
def enumerate_string(string):
"""Returns the most common characters of a string. Multiple characters are returned if there are equally frequent characters."""
from collections import Counter
counts = Counter(string)
keys = []
for key, value in counts.iteritems():
if value == max(counts.values()):
keys.append(key)
return keys
>>> msa = AlignIO.read('myalignment.fasta', 'fasta')
>>> msa_summary = AlignInfo.SummaryInfo(msa)
>>> Counter(msa_summary.get_column(22))
Counter({'I': 8, 'V': 7, 'L': 1})
>>> enumerate_string(msa_summary.get_column(22))
['I']
This clearly shows that Isoleucine is the most common, so why doesn't hmmemit
choose it?
Thanks Sean, that makes a lot of sense. I figured it was probably taking in to consideration positions in the alignment other than just the column in question, but evidently misunderstood what the majority rule was exactly doing.
Just to add a small follow up question, it's important for what I'm doing that there are no ambiguous positions in the final consensus sequence. Does
hmmemit
always call an amino acid/base or will it also output ambiguous positions?hmmemit will only output canonical residues (20 aa or 4 nt).