Dear community,
I'm interested in retrieving deep homologies for a number of genes that belong to a protein superfamily (let's say, GPCR). For it, one of the strategies was to perform HMMER searches, using an alignment or a HMM created from an aligment. For what I have read, many people use specific protein domains in order to determine which proteins found are true matches. In my case, my proteins don't have a specific domain characterizing them, and share a common domain with the rest of the family (e.g. the 7TM domain). Therefore, though I get a good number of good matches (proteins previously identified in the database as an homolog of my query genes) in my search, a number of other proteins from the family appear too, which somehow hampers determining if uncharacterized proteins in my search are true matches or not. I tried to improve this approach by using different domain architectures, but I'm still dealing with the problem of retrieving false matches. I tried to play around with E-values and Bit scores might help, and using a different kind of search (e.g. iterative search), but I haven't found a fully satisfactory way to tackle the issue.
Any thoughts?
Thank you!
the sentence " In my case, my proteins don't have a specific domain characterizing them, and share a common domain with the rest of the family (e.g. the 7TM domain). " is confusing. You are going to find a "subfamily"? Btw, I think you have to draw phylogenetic trees in that case.
Well, let's say that we have a big receptor family, like GPCRs, which contains receptors for a broad array of neurotransmitters. However, I'm interested in a specific subgroup (receptors of a specific neurotransmitter), and these receptors don't have a specific domain characterizing them, other than the 7TM (7-transmembrane) domain, which is common to all GPCRs. I thought about building phylogenetic trees, but the amount of GPCRs and species matching a given HMM query is way too big to build a tree, so I'm trying to improve my filtering (either by playing around with search thresholds or improving my query) in order to reduce the number of putative proteins to a more bearable number. I know that there are several strategies that I could use in order to identify proteins, below the domain level, such as fingerprints or Interpro protein family predictions, but I think those can still introduce errors (e.g. misassigned proteins), so that's why I ask whether there's a different strategy that I'm not aware of yet that works better.
I think checking boot strap values of phylogenetic trees and checking conservation of important residues in alignments are essential for such a detail analysis. (There are some automatic approach such as orthomcl (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403725/) but as I haven't tried this software, I may not get expected result. just FYI.)