Hello all,
I am working on a large-scale project that involves annotation of a large array of proteins, many of them of unknown function. I was wondering if there's a good way to assign pfam groups, GOGs, and enzyme commission (EC) number to the proteins where possible, so I won't have to rely on other people's assignments or on outdated databases.
So, what is the way to do it in 2018? Of course, more than 1 result is perfectly OK - I will find a way to integrate it later.
Thank you for any input.
Thank you, good tip!
I don't have much experience with nr, but wouldn't it just give me a list of matches to nr proteins? Or are all nr proteins assigned EC or COG?
I've seen COG/OrthoMCL and few other old protein clustering software packages, but I was curious if there is a streamlined way to do it.
It seems like COGs exist as PSSMs, so they require PSI-BLAST by default, and pfam is HMM profile, so they would use HMMER to assign them. Am I right?
Aligning to
nr
would identify (a lot of) proteins giving you an idea of what they could be similar to (or parts of them in terms of domains and such). You can also use searches against Pfam to get some help with functional assignments and there you would use HMMER.How was this collection put together (by prediction)? Have you removed redundancies?
This is basically a bacterial pan-genome - all the predicted proteins, with 100% redundant proteins removed, but not more - I'm experimenting with species-specific paralog discrimination.
It's quite massive and would have lots and lots of hits with nr, I am sure.