Reads per COG normalization
1
0
Entering edit mode
5.8 years ago

Hello everyone,

I alligned a metagenome sample using the eggnog-mapper software, and processed the output to get the number of reads mapped to each COG (Clusters of Orthologous Groups).

However, some COGs have a very high number of reads mapped. This happens for COGs that include many genes and species, e.g. COG0001 contains 4869 genes in 3358 species (https://version-11-0.string-db.org/cgi/network.pl?networkId=t67BvacQZjAN).

Is there some way to normalize the number of reads mapped to each COG taken this information into account?

alignment COG metagenomics • 1.7k views
ADD COMMENT
0
Entering edit mode

There are a lot of ways to normalize the data, the only question is why do you want to normalize it?

ADD REPLY
0
Entering edit mode

Hi Asaf,

The question is: "Given a metagenomic sample, which COGs are enriched in this sample?"

I intend to do the enrichment of COGs found in a sample using the number of hits to each COG, or some normalization applied over this information, as well as the string COG-COG interaction data.

ADD REPLY
1
Entering edit mode

You can only compare COGs between different set of samples, in this case just run DESeq2 or edgeR on your count table and they will do the normalization. I would recommend to use a set of single copy proteins for normalization as in MUSiCC

ADD REPLY
1
Entering edit mode
5.5 years ago
Mensur Dlakic ★ 28k

Is there some way to normalize the number of reads mapped to each COG taken this information into account?

For COGs representing common protein functions, there will be multiple hits within a genome. If you look at a COG0500 right next to the one you pointed out, it says SAM-dependent methyltransferase (60407 genes in 4956 species). Taken at face value, this means that on average there are more than 10 hits per genome to this COG. It doesn't necessarily mean that there are 10+ methyltransferases (MTRs) that are actually members of COG0500, but it does mean that there are 10+ MTRs of some kind that match COG0500 to some degree. Anyway, without knowing the exact number of potential matches per each individual genome to each COG, I don't think one can use this information for accurate normalization. It may work as a rough estimate, though.

ADD COMMENT
2
Entering edit mode

I agree. As a side note, in my experience the mapping is much more accurate when you have an assembly, predict proteins and assign them to COG groups and then count the number of reads mapped to each protein than assigning the reads directly.

ADD REPLY

Login before adding your answer.

Traffic: 2043 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6