I'm trying to decide what would be the most appropriate threshold criteria? In particular, this is for PFAMs and a human metagenomics dataset. I'm curious what people have done in the past? It looks like anvio uses --cut_ga as the default. https://github.com/merenlab/anvio/issues/498
What been generally accepted as the most appropriate for metagenomics?
--cut_ga
Use Pfam GA (gathering threshold) score cutoffs. Equivalent to
--globT <GA1> --domT <GA2>, but the GA1 and GA2 cutoffs are read
from the HMM file. hmmbuild puts these cutoffs there if the
alignment file was annotated in a Pfam-friendly alignment format
(extended SELEX or Stockholm format) and the optional GA annota-
tion line was present. If these cutoffs are not set in the HMM
file, --cut_ga doesn't work.
--cut_tc
Use Pfam TC (trusted cutoff) score cutoffs. Equivalent to
--globT <TC1> --domT <TC2>, but the TC1 and TC2 cutoffs are read
from the HMM file. hmmbuild puts these cutoffs there if the
alignment file was annotated in a Pfam-friendly alignment format
(extended SELEX or Stockholm format) and the optional TC annota-
tion line was present. If these cutoffs are not set in the HMM
file, --cut_tc doesn't work.
--cut_nc
Use Pfam NC (noise cutoff) score cutoffs. Equivalent to --globT
<NC1> --domT <NC2>, but the NC1 and NC2 cutoffs are read from
the HMM file. hmmbuild puts these cutoffs there if the alignment
file was annotated in a Pfam-friendly alignment format (extended
SELEX or Stockholm format) and the optional NC annotation line
was present. If these cutoffs are not set in the HMM file,
--cut_nc doesn't work.
I noticed a lot of wrappers use the
-Z
flag. How exactly would this influence the results?To answer your question, we need to define that the
E-value
is the product of the number of tests and theP-value
. For biological sequence comparisons, the number of tests (N
) is equivalent to the number of database sequences.Let's say you search an HMM against a database with 1 million sequences and find a match to
SSDZ_HUMAN
withE=1e-20
. If you search the same HMM against a database with 10 million sequences that containsSSDZ_HUMAN
-- or you perform a search 3 years later and the database has grown 10x -- this sequence will now haveE=1e-19
. The-Z
switch specifies theN
that will be used forE-value
calculation, regardless of the actual number of database sequences.A consequence of constant sequence growth is that
E-values
for the same match get less significant. That obviously matters little if our match hadE=1e-20
to begin with. However, if our match hasE=0.01
and the database grows 5x, next time we search itsE-value
will be statistically insignificant.Thank you, this is extremely helpful. When you're saying database size do you mean the number of HMMs in an HMM database or the number of sequences used to create the database?
Should I use the
-Z
flag used to build each PFAM individually for each hmmsearch run? For example,PF10417.9
there is this field in the PFAM of the combined databaseSM hmmsearch -Z 45638612 -E 1000 --cpu 4 HMM pfamseq
. Are these the recommended settings when running the PFAM? If so, do I need to adjust this manually for each run?N
is always the number of sequences/HMMs in a database that is being searched. If you are usinghmmsearch
, that will be the number of individual proteins in your FASTa database, unless you specify differently via the-Z
switch. If you are usinghmmpfam
orhmmscan
,N
will be the number of HMMs in your target database unless overridden via-Z
.There is no right or wrong value for
-Z
-- it is a matter of maintaining the consistency of scores/E-values over a period of time. I don't know exactly why Pfam folks chose-Z 45638612
, but it is a safe bet that at one point that was the size of their FASTa database and they decided to stick with it. I don't know what-Z
you should pick, but45638612
is as good a choice as any. For protein sequence database, any-Z
number that is in tens of millions should be a good approximation of the current database size. When searching against a database of HMMs/profiles, any-Z
that is in tens of thousands should be fine. Again, it is a matter of picking a number that will ensure internal consistency of scoring rather than trying to find an optimal-Z
value.Ok interesting, it might be a good idea if I start using
-Z 45638612
to keep analysis pipelines fairly consistent. My last question...is thescore
affected by-Z
or only theE-value
calculation?Bit-scores are not affected by
-Z
, onlyE-value
.