Question

How To Identify Proteins Present Only In Pathogens But Not In Non-Pathogens (Virulence Factors)?

2

Entering edit mode

12.7 years ago

nicole ▴ 20

I used PSI-CD-HIT-2D to compare the proteome of pathogen A to pathogen B from the same genus at 30% identity. The matched protein sequences (homologs above 30% identity) are then compared again to pathogen C from the same genus to identify proteins that present in all the pathogens. The results were then compared to non-pathogens (D,E,F) from the same genus at 30% identity to identify proteins presents only in all 3 pathogens but absent in non-pathogens (virulence factors). Proteins consistently present in pathogens but not in non-pathogens have high possibility that they played important role in the process of typical lifestyle of pathogens.

I have then tried to search the proteins that I obtained form the above (potential virulence factors) against nr protein database using blast. But I found there are hits of the same proteins too from non-pathogens (D,E,F) that I've taken into analysis, with E-value lower than 1e-05 and identity above 30%. There are conflicts between results from both cd-hit and blast programs. I have no idea how this can happen and I'm urged to obtain solutions. Anyone can help me? Thanks in advance.

blast • 3.6k views

ADD COMMENT • link updated 7.9 years ago by priyankashrivastava4 ▴ 10 • written 12.7 years ago by nicole ▴ 20

score 3 · Answer 1 · 2012-05-07

You have encountered a common problem that occurs when trying to move from a consensus-based search strategy (CD-HIT) to a pairwise based search strategy (BLASTP). In general, consensus based strategies are designed to capture deep evolutionary relationships with a single model. But sometimes, there will be sequences that are closely related (> 50% identity, E() < 1e-40) to each other, but one of the proteins can be detected by the consensus model (but is perhaps distant from its "center"), while the other cannot. (Think of two leaves on a tree on nearby branches, one of which is close enough to the root to be found with CD-HIT, but the other is just beyond detection.) The same problem occurs with PFAM.

One solution would be to use pairwise searches, rather than CD-HIT. Use BLASTP to find the proteins that are shared by the pathogenic organisms but not by non-pathogens (or use ggsearch, which I think will be better suited to this problem).

And forget about 30% identity. There will be many homologous proteins with E()-values < 1e-10 that are clearly homologous but less than 30% identical. E-values are much more reliable indicators of homology than percent identity.

score 0 · Answer 2 · 2012-05-07

0

Entering edit mode

12.7 years ago

Michael Kuhn 5.0k

CD-HIT uses heuristics to find clusters of proteins with high similarity. (The name stands for "Cluster Database at High Identity with Tolerance".) So a threshold of 30% is well outside the intended parameter range. At such a low identity threshold the heuristic will miss many pairs that have >30% identity.

Thus, you'll have to rely on the BLAST results. Note that the e-values are dependent on the size of the database, so perhaps instead of an e-value cutoff you want to use a bitscore cutoff. Bitscores have a meaning independent of the number of genes that are in the database.

ADD COMMENT • link 12.7 years ago by Michael Kuhn 5.0k

0

Entering edit mode

Thanks Michael. I've tried to use standalone BLAST. I tried to merged pathogens (B,C,D) and non-pathogen (E,F,G,H) into 1 single files and make it a database (i named it mydb) using command makeblastdb. Then i compare the pathogen A against the mydb at E-value 1e-20 so that i can probably make it like clusters of protein families that fall under the e-value specified. Can this method work? How can I sort out results with families contained only protein from pathogens (A,B,C,D)? Thanks for your helps.

ADD REPLY • link 12.7 years ago by nicole ▴ 20

score 0 · Answer 3 · 2017-02-01

0

Entering edit mode

7.9 years ago

priyankashrivastava4 ▴ 10

Hello Nicole,

I want to know the name of databases or list of non-pathogenic bacteria for human? because all the available databases showing only pathogenic bacteria for human.

ADD COMMENT • link 7.9 years ago by priyankashrivastava4 ▴ 10