Question

Hmmbuild: How To Choose The Best Alignment For Hmm Model

7

Entering edit mode

13.7 years ago

Leszek 4.2k

Hi,

I wonder whether it's better to remove weakly aligned parts of proteins from MSA or keep them for building HMM? Case: Let's say I have a bunch of homologs and I want to generate HMM (hidden Markov-model) to be able to detect their homologs from distinct species. Questions:

Shall I use all available homologs or there is some reasonable limit (min: 5 or 15? max: 50, 100, 200)? I keep in mind that alignment gets worse the more sequnce is incorporate, plus MSA software has their limitations as well.
Which MSA program will you recommend? Personally, I like MUSCLE a lot, but I'm aware MAFFT or T-Coffee perform better (but slower).
Or shall I use more aligners and used consistency based alignment (M-coffee)?
Shall I trim badly align fragments (trimAl or gBlocks)?

Cheers,

hmm hmmer alignment • 9.2k views

ADD COMMENT • link updated 6.2 years ago by michau ▴ 60 • written 13.7 years ago by Leszek 4.2k

0

Entering edit mode

Hi Jarretinha. Is it possible to obtain a seed, or alignment of specific subfamily? I'm looking for F1/Fo ATP synthase subunit C (atpE, atpH, atp9, atp5G), and pfam has only seed for all atp synthases (they are non orthologous, so I wont use them in my phylogenetic analysis)

ADD REPLY • link 6.2 years ago by michau ▴ 60

score 6 · Answer 1 · 2011-03-05

Unless your protein be something new, the best way to proceed is by looking for pre-aligned manually seeds in places like Pfam or Kadher Shameer's 3PFDB. With a seed in hand, you don't need to align from scratch. And you could use it to search for homologs at a specified distance. Remember that packages like hmmbuild down-weight closely related sequences and up-weight distant ones during hmm building. So, you must select mostly distant candidates. Keep a few close ones just to sustain a little bit more of homology signal.

The main question is: which aligment tool should I use? Well, if you have a seed then most aligners will return pretty the same result after a realigment. But, aligning from scratch can be painful. I do recommend PRANK which is phylogeny-aware or ClustalW with iterate each step option turned on (very, very, very slow but way more reliable).

Anyway you should consider the Pfam approach, i.e., searching piece by piece ((sub)domain, motifs, etc.). I've used this to reannotate selenoproteins in Kinetoplastida and was able to find things that even JGI crew missed.

score 0 · Answer 2 · 2011-03-04

If you have too many homologs, Selecting best representatives (seeds) of the family will be useful. USe HMMalign build the seed alignment,Then use that as guide to build the complete alignment.
Clustal will be my personal preference, if you have just one or two data sets. If many, Muscle will be good one to try.
instead of using multiple aligners, If available add structural information to it, See if domain and motifs with in the sequence are aligning well or not, If possible manually edit the alignments to improvise.
Trimming can be tried in poorly aligned regions like N-and C- termini. or with intermitent large loop regions.

For all these you need through information of the protein architecture, To acheive better alignment