Should assemblies be removed if core gene alignment is identical?
1
0
Entering edit mode
2.2 years ago
c_u ▴ 520

I have a set of ~100 bacterial genomes. I annotated them with Prokka and performed pangenome analysis with Roary. Roary outputs the core gene alignment file which I then used to generate a phylogenetic tree using RaxML. While running RaxML, the console output said -

IMPORTANT WARNING - Found 13 sequences that are exactly identical to other sequences in the alignment. Normally they should be excluded from the analysis.

My question is, should I remove these 13 sequences from my subsequent pangenomic/phylogenomic/other analyses based on this information? I first thought that this it would be obvious to remove these redundant/clonal sequences so that they don't mess up the statistics for gene enrichment etc. But a counter argument is that these 13 sequences are being called as exactly identical to other sequences in my database based on the core gene alignment only. What about any differences these 13 assemblies may have (from the sequences these are supposedly identical to) in the non-core genome?

In other words, what if these sequences are actually completely unique but their uniqueness lies in terms of those genes that are not core genes, but those that are present in a subset of the assemblies?

phylogenetics genome alignment pangenome • 1.1k views
ADD COMMENT
0
Entering edit mode

Did you now just deleted your previous version of this question , to (re-)post this exact same one ?

(to get back on top of the list?)

ADD REPLY
0
Entering edit mode

Hi lieven. Yes, that's what I did (with a modification in the title). Should I have not done that? I had posted my question 6 days back and didn't get a response, so I thought to give it another try.

ADD REPLY
3
Entering edit mode
2.2 years ago
Mensur Dlakic ★ 28k

Normally they should be excluded from the analysis.

It will not mess up statistics or anything like that, but there will be no separation between those samples. Since their distances are zero -- they are identical -- you will get a branch that has 13 leaves without any horizontal separation. Since this is not informative and their vertical order will be random -- they can be in any order since they are identical -- the suggestion is to use only one of them to avoid cluttering the tree. You can say in figure legend or somewhere in the text that there are 13 identical samples, but only one of them was shown.

But a counter argument is that these 13 sequences are being called as exactly identical to other sequences in my database based on the core gene alignment only. What about any differences these 13 assemblies may have (from the sequences these are supposedly identical to) in the non-core genome?

We don't know about differences outside of core, and this tree can't answer that question. You could separately run Roary on this subset to find out, but it is a safe bet that 13 samples that have many identical core genes are likely to be near-identical. It is up to you to decide whether to dig that deep and find out that some of them are maybe only 99.999% identical instead of 100%. For most applications this is a distinction without a difference, but it may matter for your purpose.

ADD COMMENT
0
Entering edit mode

That was very helpful. Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 2033 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6