Question

Should I use HMMs from Pfam or build them from scratch?

1

Entering edit mode

4.1 years ago

garfield320 ▴ 20

Sorry if this is a really simple question, I'm just starting to self-teach myself about how to build a profile HMM and am feeling pretty swamped with all these concepts/jargons.

I'm trying to decide whether I should use raw HMMs available from Pfam, TIGRFAMs, etc. or build one myself.

Say I'm really interested in looking at soil microbiome. I searched for PF00246 in the Pfam 33.1 database, and looked at the phylogenetic tree on the "Trees" tab. The tree included a wide range of organisms - humans, mouse, cows, fruit flies, etc. But I only want to include soil yeast in my tree, so that I wouldn't need to look at really distantly related organisms. I think I have two options - 1) download the raw HMM from the Pfam database and make a "full alignment" by searching the soil microbiome database - in this case I would eventually find alignments that are from the soil microbiome, so other organisms included in the raw HMM wouldn't matter. 2) Or I can build a new raw HMM that only includes organisms in the soil microbiome from a new seed sequence, then use that HMM to search against the soil microbiome database. Which option would be better?

An additional question - if the second option is better, I'm not sure if I understand how people determine which sequences are reliable enough to include as seed alignments. For instance, how were the seed alignments in these Pfam entries created? Do they try to include as many different organisms as possible? Or is there some sort of algorithm to "score" how reliable each alignment is?

pfam hmm • 1.5k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 4.1 years ago by garfield320 ▴ 20

score 1 · Answer 1 · 2020-10-13

1

Entering edit mode

4.1 years ago

Mensur Dlakic ★ 28k

HMMs in Pfam are built and updated by professional curators, who do this for a living. Whether you should rebuild them depends on your goals and your comfort in building curated alignments. Based on the information you provided about yourself, it seems that you should be using HMMs that are already available. Once you download the HMM, you can build the alignment only for a select group of sequences that interest you. For that task you will need a program called hmmalign which is a part of the HMMer package. From that alignment you make a tree that will have only your species of interest.

Sequences that are to be included in alignments are found by databases searches, which often include multiple passes. The alignments are made from identified sequences, and are often adjusted manually by curators who know how to do it.

ADD COMMENT • link 4.1 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thanks for referencing me to the hmmalign program! Just out of curiosity, if I were theoretically able to create good-quality seed alignments to build a new raw HMM, would searching that specific HMM profile against the soil microbiome database give me better results than searching the already-available HMM profile against the same soil microbiome database?

ADD REPLY • link 4.1 years ago by garfield320 ▴ 20

0

Entering edit mode

Impossible to give a general answer. I am sure that there are HMMs in Pfam that already detect all members of a given family, and there is nothing new to be added no matter how much model rebuilding is done. But there are probably some HMMs that can be improved, in particular when it comes to detecting subsets of large protein families. Yours could be in that category because the number of identified proteins in NCBI is >75 thousand. Still, it becomes a question of potentially small gain vs large effort. For example, if the existing HMM already finds 98% of sequences and you can build one that will detect 98.5% after careful curation, will that really matter? It is very unlikely that anyone outside of Pfam curators can build a model that will identify significantly higher number of sequences, because they know the process better than most.

ADD REPLY • link 4.1 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

I'm still a bit confused with what you're saying - I'm sure the curators know what they're doing and they're good at it, but the curated lists still include sequences from many different representative kingdoms, which was why I had assumed that those lists wouldn't contain exhaustive information on a specific subset of phylum/class, and adding more sequences from that phylum/class into the seed sequence and creating a new HMM might be better. Are you saying that the curators do include enough representative sequences down to phylum/class level so that HMMs created from those seed sequences would still be able to identify many sequences when ran against the database? Or that even without HMMs created using a seed from a subset of organisms it can still find a very high number of sequences due to the high quality of the curating process?

ADD REPLY • link 4.1 years ago by garfield320 ▴ 20