I have 8 metagenomic samples (bacterial DNA) that I have generated count-matrixes for (I used Prokka for annotation). This means, that I now have the abundance of all genes, in all of my 8 samples. I have then normalized my samples with TPM (Transcript per Million reads), as my sample sizes are of different size, so i can better compare the samples. I want to group my genes into COGs (Cluster of Orthologous Groups). My goal is to look at the relative gene abundance of the different samples, and be able to compare specific COGs to other COGs across different samples. But now I approached a problem.
When I group my genes into COGs, the samples that have the most functionally annotated proteins (as opposed to "Hypothetical proteins"), will naturally have a higher abundance of total reads mapped to the COGs. Say a sample "A" have 60% Hypothetical proteins and sample "B" have 40% Hypothetical proteins, then sample "B" will have more functionally annotated proteins (60%), and thus more proteins (and consequently more reads) will group into each COG. Thus, when i compare 2 similar COGs across samples, in most cases COGs from sample "B" will have more reads mapped to them.
How do I solve this problem? If i re-calculate my TPM only for the COGs (disregarding Hypothetical proteins), would that give me a false picture of the relative gene abundance?
Thanks!!
Cool thanks! After examing the reference and if I understand it correctly: I provide an abundance file of the genes and their mapped reads to Musicc, and then I can confidently assign them to COGs? Also, do I have to TPM normalize afterwards?
I actually never used MUSiCC, I just borrowed their list of single-copy proteins and used it for normalization in DESeq2.
No other TPM normalization needed. I know it's KO rather than COG, I'm not sure they can be mapped 1:1.
But do I have to associate my genes with KO? I'm not familar with KO. You wrote a list of KO-IDs, are those the same as the single-copy proteins?
The list of KOs are of single-copy proteins. I believe it's the same one used in other tools like checkm, introduced by Peer Bork's group. This paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0022099 lists their COGs
I see. I just don't get, how a single copy marker gene is identified, if Musicc is just given a KO/COG ID and an abundance number.
The input is an abundance table of COGs, it uses the SCG for normalization.