Dear all,
context
i am doing a domain analysis on a set of protein sequences retrieved from a HMM search using the profile of a specific TF family. After that, in order to filter those sequences who actually have the (entire) domain i am now using CD-search from NCBI (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). According to the README (https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDSearch_help_contents), i've seen that it uses a collection of domains from several databases (e.g. CDD, SMART, Pfam, etc..) which sounds pretty cool to me as i can handle a single output instead of using all these databases singularly and then integrating their results. What i find weird (and is the core of my question) is the type of output of this search.
It states that it returns Specific hit (is a high confidence association between a protein query sequence and a conserved domain,), Non-specific hits (If a specific hit IS NOT found on a query protein sequence, but the protein has an otherwise statistically significant hit (E-value cutoff of 0.01) to any domain model in CDD, the domain model is regarded as a non-specific hit) and Superfamily.
questions
1) What i really don't understand is why including non-specific hits in the output (by the way only present in the "full" output and NOT in the "concise" output). What can we learn from a non-specific hit ?
2) what is the output you would retain from the concise output file (considering specific and superfamily hit type)?
I really hope you have experience in this.
Thanks in advance for any help.
You can find more information here: https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_type_non_specific_hit
I copy pasted the relevant part:
Here you can see details of concise, standard and full output: https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#ConciseDisplayIllustration
Thanks, i would go for the total output then and compare it with the concise output to be on the safe side in terms of representativity for the TF family i am analyzing. DONE: the outputs are, as expected, totally different. The "full" output is huge.
Decision: i am keeping the concise results filtering for complete domains of interest.
I am including also the superfamily non-specific hits to avoid false negatives: e.g. in the same region even though another TF family domain would be the best match, i want to know which is the gene having for that superfamily a 2nd best match for my TF family domain of interest.