I used PSI-CD-HIT-2D to compare the proteome of pathogen A to pathogen B from the same genus at 30% identity. The matched protein sequences (homologs above 30% identity) are then compared again to pathogen C from the same genus to identify proteins that present in all the pathogens. The results were then compared to non-pathogens (D,E,F) from the same genus at 30% identity to identify proteins presents only in all 3 pathogens but absent in non-pathogens (virulence factors). Proteins consistently present in pathogens but not in non-pathogens have high possibility that they played important role in the process of typical lifestyle of pathogens.
I have then tried to search the proteins that I obtained form the above (potential virulence factors) against nr protein database using blast. But I found there are hits of the same proteins too from non-pathogens (D,E,F) that I've taken into analysis, with E-value lower than 1e-05 and identity above 30%. There are conflicts between results from both cd-hit and blast programs. I have no idea how this can happen and I'm urged to obtain solutions. Anyone can help me? Thanks in advance.
Thanks @Bill Pearson. I've tried to use standalone BLAST. I tried to merged pathogens (B,C,D) and non-pathogen (E,F,G,H) into 1 single files and make it a database (i named it mydb) using command makeblastdb. Then i compare the pathogen A against the mydb at E-value 1e-20 so that i can probably make it like clusters of protein families that fall under the e-value specified. Can this method work? How can I sort out results with families contained only protein from pathogens (A,B,C,D)? Thanks for your helps.