Question

comparative gene quantification, problem with COGs and hypothetical proteins.

0

Entering edit mode

5.5 years ago

Hansen_869 ▴ 80

I have 8 metagenomic samples (bacterial DNA) that I have generated count-matrixes for (I used Prokka for annotation). This means, that I now have the abundance of all genes, in all of my 8 samples. I have then normalized my samples with TPM (Transcript per Million reads), as my sample sizes are of different size, so i can better compare the samples. I want to group my genes into COGs (Cluster of Orthologous Groups). My goal is to look at the relative gene abundance of the different samples, and be able to compare specific COGs to other COGs across different samples. But now I approached a problem.

When I group my genes into COGs, the samples that have the most functionally annotated proteins (as opposed to "Hypothetical proteins"), will naturally have a higher abundance of total reads mapped to the COGs. Say a sample "A" have 60% Hypothetical proteins and sample "B" have 40% Hypothetical proteins, then sample "B" will have more functionally annotated proteins (60%), and thus more proteins (and consequently more reads) will group into each COG. Thus, when i compare 2 similar COGs across samples, in most cases COGs from sample "B" will have more reads mapped to them.

How do I solve this problem? If i re-calculate my TPM only for the COGs (disregarding Hypothetical proteins), would that give me a false picture of the relative gene abundance?

Thanks!!

gene abundance TPM COG • 2.2k views

ADD COMMENT • link 5.5 years ago by Hansen_869 ▴ 80

score 0 · Answer 1 · 2020-02-12

0

Entering edit mode

5.5 years ago

Asaf 10k

Someone else faced this problem a while ago: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0610-8 (MUSiCC)

Basically their solution is to normalize each sample using a list of single-copy proteins, this way you normalize to the estimated number of genomes in your sample.

ADD COMMENT • link 5.5 years ago by Asaf 10k

0

Entering edit mode

Cool thanks! After examing the reference and if I understand it correctly: I provide an abundance file of the genes and their mapped reads to Musicc, and then I can confidently assign them to COGs? Also, do I have to TPM normalize afterwards?

ADD REPLY • link 5.5 years ago by Hansen_869 ▴ 80

0

Entering edit mode

I actually never used MUSiCC, I just borrowed their list of single-copy proteins and used it for normalization in DESeq2.

# A universal list of KOs from musicc https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4391136/ 
universals <- c("K00133","K00789","K00927","K00939","K01689","K01803","K01866","K01867","K01868","K01869","K01870","K01872","K01873","K01874","K01875","K01876","K01881","K01883","K01887","K01889","K01890","K01892","K01937","K02357","K02519","K02528","K02600","K02601","K02835","K02838","K02863","K02864","K02867","K02871","K02874","K02876","K02878","K02879","K02881","K02884","K02886","K02887","K02890","K02892","K02895","K02904","K02906","K02926","K02931","K02933","K02946","K02948","K02950","K02952","K02956","K02961","K02965","K02967","K02982","K02986","K02988","K02992","K02994","K02996","K03040","K03076","K03106","K03110","K03438","K03470","K03664","K03687","K03702","K06942","K09903","K10773")
dds <- estimateSizeFactors(dds, locfunc=shorth, controlGenes=rownames(counts) %in% universals)

No other TPM normalization needed. I know it's KO rather than COG, I'm not sure they can be mapped 1:1.

ADD REPLY • link 5.5 years ago by Asaf 10k

0

Entering edit mode

But do I have to associate my genes with KO? I'm not familar with KO. You wrote a list of KO-IDs, are those the same as the single-copy proteins?

ADD REPLY • link 5.5 years ago by Hansen_869 ▴ 80

0

Entering edit mode

The list of KOs are of single-copy proteins. I believe it's the same one used in other tools like checkm, introduced by Peer Bork's group. This paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0022099 lists their COGs

ADD REPLY • link 5.5 years ago by Asaf 10k

0

Entering edit mode

I see. I just don't get, how a single copy marker gene is identified, if Musicc is just given a KO/COG ID and an abundance number.