Question

Why does phylogenetic tree topology changes, among nucleotide and amino acid sequences basis?

0

Entering edit mode

6.2 years ago

Kumar ▴ 120

I have generated a phylogenetic tree based on the amino acid sequence (28 genes been extracted from 30 bacterial strains and concatenated), which is obtained by AMPHORA tool, the tree topology is proper as I expected. Then, I have extracted the nucleotide sequences of the same genes based on the coordinates mentioned in AMPHORA output amino acid sequences. Most of the coordinates were reverse in order but, I did not make reverse complement of the same. It is because, I strongly rely on alignment tool such as mafft. mafft has --adjustdirection option, which can reverse complement the sequences if it needed. I have opted this option and aligned the nucleotide sequences and generated the tree, but it topology completely vary from the amino acid based tree. I need to have nucleotide sequences based phylogenetic tree for my further analysis. But due to this issue I could not move further in my analysis. Hence, kindly help me to fix this issue.

Thank you in advance.

phylogeny alignment • 2.1k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 6.2 years ago by Kumar ▴ 120

0

Entering edit mode

Help you fix what issue?

If you believe your data to be correct, then the tree is correct.

DNA sequences and Amino Acid sequences evolve at different rates so there are considerations in evolutionary model etc to be made, but there is no way to 'fix' your data, so I'm not really sure what the question is?

Can you elaborate on how exactly your DNA sequences were retrieved? It sounds to me like there's the possibility of some confusion in the reverse complementing etc that you mentioned. If they are concatenated alignments, I would expect that the signal is such that the correct topology would be returned, and be the same as the amino acids, so my first port of call would be to scrutinse your sequences closer.

ADD REPLY • link 6.2 years ago by Joe 22k

0

Entering edit mode

Dear healey, I have extracted the nucleotide sequences from the bacterial genome in reverse order, for example

>gene1 (contig1:520-350)
>gene2 (contig6:420-560)

likewise the concatenated sequences (extracted from the above coordinates) were mixed with proper sequences and reverse order sequences as well. I did not make the reverse complement of those sequences been extracted in reverse order >gene1 (contig1:520-350). However, I have aligned the same with mafft --adjustdirection. As per my opinion, mafft failed to make reverse complement of those sequences mentioned above. I guess the tree topology changes because of this issue. That is why I seek any alternative to reverse complement of the same. I mean is there any tool available better than mafft to do the same.

ADD REPLY • link 6.2 years ago by Kumar ▴ 120

0

Entering edit mode

So you have 'normal' (5' -> 3') amino acid sequences which you aligned, and they look correct?

And you have DNA sequences but some of them are backwards? Why?

I would advise that you simply extract the DNA sequences properly in the correct orientation in the first place. If you're trying to align a set of sequences which are mixed up between reverse and 'normal', of course the tree will change.

ADD REPLY • link 6.2 years ago by Joe 22k

0

Entering edit mode

I agree with you healey, But, Amphora output fasta header is shown below,

>ACPX01000193.1_510 [58245 - 57682] (REVERSE SENSE) Xanthomonas citri pv. aurantifolii str. ICPB 11122 xaub_ctg651_0, whole genome shotgun sequence

I have made the above showed output as ACPX01000193.1:57682-58245 for xarg samtools extraction. If it, less number of sequences, I could do these reverse complementing process effectively. But, I have more than 200 sequences to be reverse complemented. This would be a nightmare. That is why I am in need of any tool, which can sense the reverse order sequences and do the necessary changes. If you know any of the tool which can perform the same, please suggest me.

ADD REPLY • link 6.2 years ago by Kumar ▴ 120

0

Entering edit mode

Sorry, I'm not familiar with AMPHORA, so I'm not sure what it does.

Are the sequences which are reversed on their own in a single file, or mixed in with the sequences which are 'forward'?

Please post a small representative example of the fasta file with the headers and reversed sequences.

ADD REPLY • link 6.2 years ago by Joe 22k

0

Entering edit mode

Amphora output are shown below,

>ACPX01000193.1_510 [58245 - 57682] (REVERSE SENSE) Xanthomonas citri pv. aurantifolii str. ICPB 11122 xaub_ctg651_0, whole genome shotgun sequence
SGCAGSIPLGDCNISTPDNKQNRKNQEIRVPRVRVIGSDGEMVGVLSRDEALAKAEEEGL
DLVEIQPQADPPVCKIMDFGKFKFEQQKKANEAKKKTKQVEIKELKFRPVTDEGDYQIKL

>ACPX01000186.1_237 [28391 - 29581] Xanthomonas citri pv. aurantifolii str. ICPB 11122 xaub_ctg644_0, whole genome shotgun sequence
TERAEPMSIVRMTDLDLSGKRVLIRQDLNVPIDNGQITSEQRITASVPTIKLALEKGAAV
MVTSHLGRPKEGSWTEEDSLAPVATRLAALLGVDVPLVRDWVDGVDVAPGQVVLLENCRM
NVGEGKDDETLARKYAALCDVFVMDAFGTAHRAQASTHGVIRFAPVAAGGPLLMAELDAL

So, I formated the fasta header for samtools xarg extraction as shown below,

ACPX01000193.1:57682-58245
ACPX01000186.1:28391-29581

In this two header, the first one was in reverse sense, So I simply reversed the coordinates. Whereas the second one is proper one, no need to reverse it. likewise AMPHORA gives mixed output.

ADD REPLY • link 6.2 years ago by Kumar ▴ 120

0

Entering edit mode

So, like this I have more than 200 sequences to be reverse complemented. This is where I am facing the problem. So, could you please suggest me any way to do the same.

ADD REPLY • link 6.2 years ago by Kumar ▴ 120

score 1 · Answer 1 · 2019-06-10

The best solution, I feel, would simply be to run this over all the 'jumbled' data, and ensure you have everything in the correct orientation before proceeding. Once you have done so, you can concatenate them or whatever, and proceed to make the tree.

If the sequences are in the correct orientation, it should leave them unmodified, thus only changing the necessary ones. It does not modify the headers however, so they will still say the coordinates in reverse order.

from Bio import SeqIO
import re

regex = re.compile("\[\d*\ -\ \d*\]")

for rec in SeqIO.parse("reverse.fa", "fasta"):
    indexes = re.search(regex, rec.description).group(0).lstrip("[").rstrip("]")
    L, R = indexes.replace(" ", "").split("-")
    if int(L) > int(R):
        print(">{}\n{}".format(rec.description, rec.seq.reverse_complement()))
    else:
        print(">{}\n{}".format(rec.description, rec.seq))