Question

hmmer identical target and query proteins

3

Entering edit mode

3.7 years ago

mnsp088 ▴ 100

Hi all, I'm running about 50 proteins of a particular species (sub_proteins.fa) against all proteins for that specie (proteome.fa) using hmmer. I have done this using phmmer with the following command:

phmmer --tblout results.table sub_proteins.fa proteome.fa

However, I noticed that in the resulting file (results.table) there are multiple lines where the same protein is being compared to itself (I guess because the 50 query proteins are also found in the proteome file). See below for an example:

# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- 
YLL_767           -          YLL_767           -           8.1e-177  580.6   1.4    1e-175  582.3   1.4   1.0   1   0   0   1   1   1   1

is there anyway to prevent the hmmer programs from doing this? The reason I am concerned about this is because after this step I am going to be combining the query and significant targets together into a model that will then be used to search against other species proteomes...will having redundant proteins affect my results?

Any help is greatly appreciated!

hmmer hmm hmmsearch • 1.0k views

ADD COMMENT • link 3.7 years ago by mnsp088 ▴ 100

score 2 · Accepted Answer · 2021-03-23

There is no way to prevent the program from detecting identical matches. In most instances that's the exact purpose of these programs!

As to whether that will affect your downstream results, it depends on what you mean exactly by I am going to be combining the query and significant targets together into a model. If you mean that you plan to build a hidden Markov model from your query + identified sequences, then it doesn't matter that some sequences in your alignment are going to be identical. Model building procedure automatically down-weighs identical (and even very similar) sequences, so the net effect will be as if you did not have duplicates.

If you want to convince yourself that this is a case, make an alignment of two identical sequences - or 10 identical sequences, for that matter. When you run hmmbuild on that alignment, it will print a summary saying how many sequences are in the alignment (nseq), followed by how many effective sequences (eff_nseq) are there. No matter how many identical sequences you put in your alignment, the eff_nseq will be 1 at most, but more likely smaller than 1. That means even though the alignment has X number of sequences, the HMM building program will act as if it has seen only one of them. Even for large alignments of non-identical sequences, say 2000 of them, eff_nseq will rarely go into double digits unless your sequences are truly diverse.