Hi guys, I’m trying to set up a COG database using the 2014 updated data (from ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/). As I understand how it works, one needs to run the proteins from the COGs against a database of the same proteins using PSI BLAST. Then I would obtain as many .smp files as queries, with which I could then run makeprofiledb to create an RPS-BLAST database. So far I’ve done the following: Downloaded the protein file prot2003-2014.fa.gz Created a blastdb with the extracted file (makeblastdb) Split the prot2003-2014.fa.gz multifasta file in single fasta’s so I can use each one as query (in PSI-Blast) and get individual .smp files The resulting .smp files have hits against one or more sequences in the DB. Now, I could limit the hits by decreasing the e-value so as to get .smp files per query with only one hit. My question is, does it really matter whether .smp files return hits against only one protein or more? Put it the other way, what’s an acceptable e-value for PSI-BLAST? I know there’s no one-size-fits-all e-value but customarily I would use 10E-5 for finding orthologs. In this case, if I only want to keep one hit per .smp, I need to ramp up the e-value to 10E-100.
regards
You just need to read the instruction of the COGsoft (COGnitor) https://sourceforge.net/projects/cogtriangles/ PS: do not add question to old post. Open a new post instead.