How To Identify Proteins Present Only In Pathogens But Not In Non-Pathogens (Virulence Factors)?
3
2
Entering edit mode
12.6 years ago
nicole ▴ 20

I used PSI-CD-HIT-2D to compare the proteome of pathogen A to pathogen B from the same genus at 30% identity. The matched protein sequences (homologs above 30% identity) are then compared again to pathogen C from the same genus to identify proteins that present in all the pathogens. The results were then compared to non-pathogens (D,E,F) from the same genus at 30% identity to identify proteins presents only in all 3 pathogens but absent in non-pathogens (virulence factors). Proteins consistently present in pathogens but not in non-pathogens have high possibility that they played important role in the process of typical lifestyle of pathogens.

I have then tried to search the proteins that I obtained form the above (potential virulence factors) against nr protein database using blast. But I found there are hits of the same proteins too from non-pathogens (D,E,F) that I've taken into analysis, with E-value lower than 1e-05 and identity above 30%. There are conflicts between results from both cd-hit and blast programs. I have no idea how this can happen and I'm urged to obtain solutions. Anyone can help me? Thanks in advance.

blast • 3.6k views
ADD COMMENT
3
Entering edit mode
12.6 years ago
Bill Pearson ★ 1.0k

You have encountered a common problem that occurs when trying to move from a consensus-based search strategy (CD-HIT) to a pairwise based search strategy (BLASTP). In general, consensus based strategies are designed to capture deep evolutionary relationships with a single model. But sometimes, there will be sequences that are closely related (> 50% identity, E() < 1e-40) to each other, but one of the proteins can be detected by the consensus model (but is perhaps distant from its "center"), while the other cannot. (Think of two leaves on a tree on nearby branches, one of which is close enough to the root to be found with CD-HIT, but the other is just beyond detection.) The same problem occurs with PFAM.

One solution would be to use pairwise searches, rather than CD-HIT. Use BLASTP to find the proteins that are shared by the pathogenic organisms but not by non-pathogens (or use ggsearch, which I think will be better suited to this problem).

And forget about 30% identity. There will be many homologous proteins with E()-values < 1e-10 that are clearly homologous but less than 30% identical. E-values are much more reliable indicators of homology than percent identity.

ADD COMMENT
0
Entering edit mode

Thanks @Bill Pearson. I've tried to use standalone BLAST. I tried to merged pathogens (B,C,D) and non-pathogen (E,F,G,H) into 1 single files and make it a database (i named it mydb) using command makeblastdb. Then i compare the pathogen A against the mydb at E-value 1e-20 so that i can probably make it like clusters of protein families that fall under the e-value specified. Can this method work? How can I sort out results with families contained only protein from pathogens (A,B,C,D)? Thanks for your helps.

ADD REPLY
0
Entering edit mode
12.6 years ago

CD-HIT uses heuristics to find clusters of proteins with high similarity. (The name stands for "Cluster Database at High Identity with Tolerance".) So a threshold of 30% is well outside the intended parameter range. At such a low identity threshold the heuristic will miss many pairs that have >30% identity.

Thus, you'll have to rely on the BLAST results. Note that the e-values are dependent on the size of the database, so perhaps instead of an e-value cutoff you want to use a bitscore cutoff. Bitscores have a meaning independent of the number of genes that are in the database.

ADD COMMENT
0
Entering edit mode

Thanks Michael. I've tried to use standalone BLAST. I tried to merged pathogens (B,C,D) and non-pathogen (E,F,G,H) into 1 single files and make it a database (i named it mydb) using command makeblastdb. Then i compare the pathogen A against the mydb at E-value 1e-20 so that i can probably make it like clusters of protein families that fall under the e-value specified. Can this method work? How can I sort out results with families contained only protein from pathogens (A,B,C,D)? Thanks for your helps.

ADD REPLY
0
Entering edit mode
7.8 years ago

Hello Nicole,

I want to know the name of databases or list of non-pathogenic bacteria for human? because all the available databases showing only pathogenic bacteria for human.

ADD COMMENT

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6