Question

Conserved Domain analysis (CD-search)

0

Entering edit mode

5.2 years ago

lessismore ★ 1.4k

Dear all,

context
i am doing a domain analysis on a set of protein sequences retrieved from a HMM search using the profile of a specific TF family. After that, in order to filter those sequences who actually have the (entire) domain i am now using CD-search from NCBI (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). According to the README (https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDSearch_help_contents), i've seen that it uses a collection of domains from several databases (e.g. CDD, SMART, Pfam, etc..) which sounds pretty cool to me as i can handle a single output instead of using all these databases singularly and then integrating their results. What i find weird (and is the core of my question) is the type of output of this search.

It states that it returns Specific hit (is a high confidence association between a protein query sequence and a conserved domain,), Non-specific hits (If a specific hit IS NOT found on a query protein sequence, but the protein has an otherwise statistically significant hit (E-value cutoff of 0.01) to any domain model in CDD, the domain model is regarded as a non-specific hit) and Superfamily.

questions
1) What i really don't understand is why including non-specific hits in the output (by the way only present in the "full" output and NOT in the "concise" output). What can we learn from a non-specific hit ?
2) what is the output you would retain from the concise output file (considering specific and superfamily hit type)?

I really hope you have experience in this.

Thanks in advance for any help.

conserved domain CDD NCBI • 4.4k views

ADD COMMENT • link 5.2 years ago by lessismore ★ 1.4k

1

Entering edit mode

You can find more information here: https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_type_non_specific_hit

I copy pasted the relevant part:

Types of RPS-BLAST hits: Table of Contents for CD-Search helpback to top

CD-Search results can include hit types that represent various confidence levels (specific hits, non-specific hits) and domain model scope (superfamilies, multi-domains). They can be seen in both the Concise display and Full display, except for non-specific hits, which are shown only in the Full Display. Specific hit is the top-ranking RPS-BLAST hit (compared to other hits in overlapping intervals) that meets or exceeds a domain-specific E-value threshold (details and illustration). It represents a very high confidence that the query sequence belongs to the same protein family as the sequences used to create the domain model, and therefore a high confidence level for the inferred function of the protein query sequence.

Non-specific hits meet or exceed the RPS-BLAST threshold for statistical significance (default E-value cutoff of 0.01, or an E-value selected by the user with advanced search options). (NOTE: Non-specific hits are shown only in the full display (illustration) of search results. In contrast, the concise display (illustration) shows only the superfamily to which the top-scoring non-specific hit for a given sequence region belongs.)

Superfamily is the domain cluster to which the specific and/or non-specific hits belong. This is a set of conserved domain models that generate overlapping annotation on the same protein sequences and are assumed to represent evolutionarily related domains. (See additional details, including information about clustering methodology, under "What is a superfamily?") In the Concise Display, if a region of the query sequence has only non-specific hits to domain models from a given superfamily, only the superfamily footprint will be displayed -- not the individual superfamily members to which the query sequence had non-specific hits. To see the latter, view the Full Display of search results. In that display, the width of the box that encloses superfamily members is determined by the alignment span of the highest scoring superfamily member.

Multi-domains are domain models that were computationally detected and are likely to contain multiple single domains. They are typically shown as grey-colored bars. (Examples are shown in the concise display and full display illustrations.)

Here you can see details of concise, standard and full output: https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#ConciseDisplayIllustration

ADD REPLY • link 5.2 years ago by Fatima ▴ 1000

0

Entering edit mode

Thanks, i would go for the total output then and compare it with the concise output to be on the safe side in terms of representativity for the TF family i am analyzing. DONE: the outputs are, as expected, totally different. The "full" output is huge.

Decision: i am keeping the concise results filtering for complete domains of interest.
I am including also the superfamily non-specific hits to avoid false negatives: e.g. in the same region even though another TF family domain would be the best match, i want to know which is the gene having for that superfamily a 2nd best match for my TF family domain of interest.

ADD REPLY • link 5.2 years ago by lessismore ★ 1.4k