Hello,
Current Scenario:
I am currently engrossed in the analysis of gram-positive pathogenic bacterial genomes, having meticulously selected approximately 300 genomes for my research. To annotate these genomes, I employed Prokka, which yielded intriguing results, indicating a gene count ranging from 6,000 to 7,000 genes per genome. Subsequently, I harnessed the .GFF files generated by Prokka as inputs for the Pangenome analysis using Roary.
Anticipated Outcomes:
Given the range of 6,000 to 7,000 genes per genome, the total gene count would theoretically amount to:
Total Genes = 6,000 (assuming the minimum value) * 300 (genomes) = 1,800,000 genes
It's noteworthy that existing literature suggests an open pan-genome structure within the bacterial group under study, wherein core genes typically comprise a modest 0.3% to 0.5% of the total gene pool.
Findings So Far:
In an endeavor to unveil the Pangenome's secrets, I conducted Roary analyses employing various BlastP identity thresholds (i=70/80/85/90). However, the total gene count derived from Roary analysis exhibited a perplexing range, fluctuating from 250,000 to 470,000 genes. This range signifies that core genes only account for a mere 0.1% to 0.2% of the entire gene pool.
Naturally, this disparity has left me pondering the root cause of the discrepancy. Upon scrutinizing the results obtained from Prokka, I couldn't discern any apparent issues.
Queries Arise:
- Should I consider employing an alternative program for Pangenome analysis to potentially rectify these unexpected outcomes?
- Given that Roary is widely recommended in the literature for bacterial Pangenome analysis, are there specific adjustments or configurations within the Roary program that I could explore to align with my expectations?
- Could it be possible that my extensive dataset of genomes is contributing to these unexpected results?
I am immensely appreciative of any insights or guidance that anyone can offer to help me address these questions and navigate this intriguing challenge. Thank you in advance for your assistance.
Hi @liorglic,
Thank you for your prompt response.
I totally agree with you that different blastp cuttoffs provide different results and I do observe an open pan-genome structure, however I am concerned about the total genes which ranges from 250,000 to 470,000 genes. Based on the annotation provided by prokka, the total genes in Roary should be close to 1,800,000 genes. And the total genes I derived from Roary is not even to close to that. Isn't that concerning while I try to publish my data in the journal? And also the number of core genes are just 0.1% to 0.2% which is far away from the published literarture (i.e. 0.3% to 0.5%).
Oh, so you are concerned that you have too few genes... In this case, I think you are misunderstanding the concept of genes in a pan genome. Let's say you choose a blastp cutoff of 90%. You align gene1 from genome A to gene2 from genome B, and you find 95% identity. In such cases, gene1 and gene2 will be defined as the same (pan-)gene in the pan-genome, so this gene will be counted only once. So it is always expected that the pan-genome will contain a number of (pan-)genes lower than the total number of genes in all genomes. It would be very weird if you end up with 6000*300=1.8M genes.
Regarding the % of core genes - this depends on your definition for core genes. Are only genes present in 100% of the genomes considered core? If so, than it is not surprising that you get lower percentage compared to studies that used less genomes. You'll get a higher percent if you define core genes e.g. as those present in 95% of the genomes.
I hope this clarifies things a bit.
Hi @liorglic,
Thank you so much for your response. This provides a clarification to most of my doubts and questions regarding pangenome analysis and preferably via Roary. With regards to the core genes, will set it 95%.
So just to confirm one last time, When I use the BlastP identity thresholds (i=70/80/85/90) and get the total gene count derived from Roary ranging from 250,000 to 470,000 genes, that means that I am at the right track and I do not need to be concerned about those 1.8M genes?
Yes, you are correct. If anything, I would be worried that 250k genes is too high. If you have around 6k genes per sample, and you end up with 250k, it means that the overlap between genomes is very low. This could be a biological phenomenon, but you need to validate it. You'll need to take a deeper dive intro the results and see if things make sense.
And here I was wondering that 250K genes is too low. Bacterial Pangenomics is indeed complicated topic to deal with. Sure, will take a look into the results and will do the analysis accordingly.
Thank you so much for all your suggestions and feedback.