How is the observed frequencies obtained via HHBlits?
1
0
Entering edit mode
8 weeks ago
Tze Jet • 0

I recently read a paper on DNA residue binding (DRNApred), and I am interested to know how are the 20 observed frequencies for each amino acid are obtained via HHblits. From my understanding, you use HHblits to generate the MSA file from the input sequence against your database, then obtain the 20 observed frequencies from the output MSA file by doing some processing, using a command:

hhblits -i input_sequence.fasta -d /path/to/nr -o output.hhr -oa3m output.a3m
  1. Is this chain of thought correct?
  2. Are there any ready tools available that help you convert the MSA file to a observed frequency matrix?
  3. In this sense, is it similar to a PSSM matrix generated via PSI-BLAST?
DRNApred Protein HHBlits HMM • 264 views
ADD COMMENT
2
Entering edit mode
8 weeks ago
Mensur Dlakic ★ 28k

All profile-based search programs (PSI-BLAST, HHblits, jackhmmer) generate alignment column frequencies on the fly, as that is how they iterate to find more homologs. Yet this part of the search process is built-in and usually doesn't exist separately from the whole pipeline.

I have solved this by building HHM (not HMM) files from HHblits alignments, which can be done using hhmake from the same suite. Alignment frequencies can be extracted from the resulting HHM file. It is not trivial and requires some programming, but it can be done. Below is the link explaining file format in which the frequencies are saved.

https://github.com/soedinglab/hh-suite/wiki#hhsearchhhblits-model-format-hhm-format

Since these are floating numbers rather than half-bits processed values like in a PSSM matrix, you would need to format the output if the goal is to get a PSSM matrix output. I don't suggest you do that as it involves a loss of precision. Instead, I suggest you adapt the downstream programs to use the more accurate floating numbers.

Beware that column frequencies extracted from HHM files do not have pseudo-counts added to them, so many residue frequencies will be zeros. That is dealt with on the fly by hhsearch and other programs that use HHMs, but you will have to come up with your own solution to fix this as residue frequencies generally should not be zeros. One way to do it is by using a -cs option of hhmake to add context-specific pseudo-counts to the resulting HHM file.

ADD COMMENT

Login before adding your answer.

Traffic: 1294 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6