I alligned a metagenome sample using the eggnog-mapper software, and processed the output to get the number of reads mapped to each COG (Clusters of Orthologous Groups).
The question is: "Given a metagenomic sample, which COGs are enriched in this sample?"
I intend to do the enrichment of COGs found in a sample using the number of hits to each COG, or some normalization applied over this information, as well as the string COG-COG interaction data.
You can only compare COGs between different set of samples, in this case just run DESeq2 or edgeR on your count table and they will do the normalization. I would recommend to use a set of single copy proteins for normalization as in MUSiCC
Is there some way to normalize the number of reads mapped to each COG taken this information into account?
For COGs representing common protein functions, there will be multiple hits within a genome. If you look at a COG0500 right next to the one you pointed out, it says SAM-dependent methyltransferase (60407 genes in 4956 species). Taken at face value, this means that on average there are more than 10 hits per genome to this COG. It doesn't necessarily mean that there are 10+ methyltransferases (MTRs) that are actually members of COG0500, but it does mean that there are 10+ MTRs of some kind that match COG0500 to some degree. Anyway, without knowing the exact number of potential matches per each individual genome to each COG, I don't think one can use this information for accurate normalization. It may work as a rough estimate, though.
I agree. As a side note, in my experience the mapping is much more accurate when you have an assembly, predict proteins and assign them to COG groups and then count the number of reads mapped to each protein than assigning the reads directly.
There are a lot of ways to normalize the data, the only question is why do you want to normalize it?
Hi Asaf,
The question is: "Given a metagenomic sample, which COGs are enriched in this sample?"
I intend to do the enrichment of COGs found in a sample using the number of hits to each COG, or some normalization applied over this information, as well as the string COG-COG interaction data.
You can only compare COGs between different set of samples, in this case just run DESeq2 or edgeR on your count table and they will do the normalization. I would recommend to use a set of single copy proteins for normalization as in MUSiCC