Best way to assign pfam/EC/COGs to a set of proteins
3
1
Entering edit mode
6.4 years ago
predeus ★ 2.1k

Hello all,

I am working on a large-scale project that involves annotation of a large array of proteins, many of them of unknown function. I was wondering if there's a good way to assign pfam groups, GOGs, and enzyme commission (EC) number to the proteins where possible, so I won't have to rely on other people's assignments or on outdated databases.

So, what is the way to do it in 2018? Of course, more than 1 result is perfectly OK - I will find a way to integrate it later.

Thank you for any input.

pfam COG enzyme protein • 3.8k views
ADD COMMENT
2
Entering edit mode
6.4 years ago
GenoMax 148k

Editing your original post (without any changes) bumps it up to main page (for future reference).

Have you tried the usual DIAMOND search against NCBI nr? There is also old COG software from NCBI. OrthoMCL would also be perhaps of interest.

ADD COMMENT
0
Entering edit mode

Thank you, good tip!

I don't have much experience with nr, but wouldn't it just give me a list of matches to nr proteins? Or are all nr proteins assigned EC or COG?

I've seen COG/OrthoMCL and few other old protein clustering software packages, but I was curious if there is a streamlined way to do it.

It seems like COGs exist as PSSMs, so they require PSI-BLAST by default, and pfam is HMM profile, so they would use HMMER to assign them. Am I right?

ADD REPLY
0
Entering edit mode

Aligning to nr would identify (a lot of) proteins giving you an idea of what they could be similar to (or parts of them in terms of domains and such). You can also use searches against Pfam to get some help with functional assignments and there you would use HMMER.

How was this collection put together (by prediction)? Have you removed redundancies?

ADD REPLY
0
Entering edit mode

This is basically a bacterial pan-genome - all the predicted proteins, with 100% redundant proteins removed, but not more - I'm experimenting with species-specific paralog discrimination.

It's quite massive and would have lots and lots of hits with nr, I am sure.

ADD REPLY
1
Entering edit mode
6.4 years ago

Have you tried to use Bacterial Pan Genome Analysis Pipeline. it gives you all proteins found in core, accessory and unique genomes of your analysis. It have option to find KEGG and COG orthologs. By this option, it searches KEGG and COG database using USEARCH algorithm and gives you result in tabular format (like BLAST outfmt 6) and list of KO's and COG ids in each type of genome also.

Give it try.

Or in any case, you can simply BLAST your query proteins against whichever database you want and use this paper as a reference to use cutoffs on BLAST score, in order to find significant orthologs or hits.

ADD COMMENT
1
Entering edit mode
6.3 years ago
Carambakaracho ★ 3.3k

Though a bit late, but in case you're interested rather in annotation than clustering, the EggNOG database is a good resource and the eggnog-mapper a pretty good tool for annotation. You can choose from different data sets and between HMMER or diamond.

ADD COMMENT

Login before adding your answer.

Traffic: 2193 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6