Sorry if this is a really simple question, I'm just starting to self-teach myself about how to build a profile HMM and am feeling pretty swamped with all these concepts/jargons.
I'm trying to decide whether I should use raw HMMs available from Pfam, TIGRFAMs, etc. or build one myself.
Say I'm really interested in looking at soil microbiome. I searched for PF00246 in the Pfam 33.1 database, and looked at the phylogenetic tree on the "Trees" tab. The tree included a wide range of organisms - humans, mouse, cows, fruit flies, etc. But I only want to include soil yeast in my tree, so that I wouldn't need to look at really distantly related organisms. I think I have two options - 1) download the raw HMM from the Pfam database and make a "full alignment" by searching the soil microbiome database - in this case I would eventually find alignments that are from the soil microbiome, so other organisms included in the raw HMM wouldn't matter. 2) Or I can build a new raw HMM that only includes organisms in the soil microbiome from a new seed sequence, then use that HMM to search against the soil microbiome database. Which option would be better?
- An additional question - if the second option is better, I'm not sure if I understand how people determine which sequences are reliable enough to include as seed alignments. For instance, how were the seed alignments in these Pfam entries created? Do they try to include as many different organisms as possible? Or is there some sort of algorithm to "score" how reliable each alignment is?
Thanks for referencing me to the
hmmalign
program! Just out of curiosity, if I were theoretically able to create good-quality seed alignments to build a new raw HMM, would searching that specific HMM profile against the soil microbiome database give me better results than searching the already-available HMM profile against the same soil microbiome database?Impossible to give a general answer. I am sure that there are HMMs in Pfam that already detect all members of a given family, and there is nothing new to be added no matter how much model rebuilding is done. But there are probably some HMMs that can be improved, in particular when it comes to detecting subsets of large protein families. Yours could be in that category because the number of identified proteins in NCBI is >75 thousand. Still, it becomes a question of potentially small gain vs large effort. For example, if the existing HMM already finds 98% of sequences and you can build one that will detect 98.5% after careful curation, will that really matter? It is very unlikely that anyone outside of Pfam curators can build a model that will identify significantly higher number of sequences, because they know the process better than most.
I'm still a bit confused with what you're saying - I'm sure the curators know what they're doing and they're good at it, but the curated lists still include sequences from many different representative kingdoms, which was why I had assumed that those lists wouldn't contain exhaustive information on a specific subset of phylum/class, and adding more sequences from that phylum/class into the seed sequence and creating a new HMM might be better. Are you saying that the curators do include enough representative sequences down to phylum/class level so that HMMs created from those seed sequences would still be able to identify many sequences when ran against the database? Or that even without HMMs created using a seed from a subset of organisms it can still find a very high number of sequences due to the high quality of the curating process?