I would like to know the best strategy to get the highest amount of GO terms for the bacterial proteome I'm working. Since it's a non-model organism, I will build the GO database from scratch.
I obtained 60% GO annotated proteome BLASTing to bacterial nr protein database (retaining first 20 hits), but some of them are very general. Same results were obtaining with BLAST2GO InterPro mapping.
I've been thinking to BLAST against uniprot and nr databases and merge results. Also, I would like to know how many hits should I retain from BLAST searches.
In case somebody is searching this: newest version of InterProScan works very nicely and adds GO terms as well as Reactome pathways to the annotation based on discovered domains. Of course, these approaches have a bunch of limitations, but still - I think this is the easiest way to do it. Took me about 8 hours on 64 cores for 38,000 proteins.
You should only be retaining one hit per subject species and these hits should be verified through reciprocal blast. Multiple hits in a single species for a given gene of interest doesn't make sense. Its equivalent to saying your gene of interest does all of the functions of the n genes in the subject species.
You can merge results, but I imagine that you'll probably get a great deal of duplicates. I'm not very familiar with BLAST2GO, but RefSeq doesn't naturally have GO annotation, so B2G must be pulling those from somewhere else.
Having many generic results after GO annotation is common, at least in my experience, if a given gene's orthologs are poorly annotated, you can't do any better.
ADD COMMENT
• link
updated 4.9 years ago by
Ram
44k
•
written 10.6 years ago by
pld
5.1k