Question

Building a phylogenetic tree based on some selected strains of bacteria sequences downloaded from ncbi

0

Entering edit mode

4.6 years ago

ali_karimnezhad ▴ 20

How can I build a phylogenetic tree based on some selected strains of bacteria sequences downloaded from ncbi? For example, I picked 'J1776','ScottA','R2-502' from Listeria monocytogenes bacteria and downloaded three fasta (.fna) files using the following links: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/438/585/GCF_000438585.1_ASM43858v1/GCF_000438585.1_ASM43858v1_genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/866/905/GCF_009866905.1_ASM986690v1/GCF_009866905.1_ASM986690v1_genomic.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/438/705/GCF_000438705.2_ASM43870v2/GCF_000438705.2_ASM43870v2_genomic.fna.gz

Then, concatenated the *.fna files into one .fasta file, and tried megax and beauti but they both give me errors indicating that sequence lengths are not equal. Am I doing anything wrong?

alignment sequence • 2.7k views

ADD COMMENT • link updated 4.6 years ago by Mensur Dlakic ★ 29k • written 4.6 years ago by ali_karimnezhad ▴ 20

0

Entering edit mode

Yes, you are doing something wrong. But without more details there could be many reasons. I guess it's just you are not aligning the sequences before trying to build the tree. You need to generate a multiple alignment. Megax can do it too. Also, you posted it with the wrong tags. That's not related to assembly.

ADD REPLY • link 4.6 years ago by juanjo75es ▴ 130

0

Entering edit mode

Thanks for the reply. I tried aligning the tree sequences by MUSCLE in Megax, but I got "Error-Alignment Failed: MUSCLE Log file did not end properly, suggesting an unhandled exception." Then, I tried Mauve. Although I did not get any errors with Mauve, the resulting fasta has more than 3 rows of sequences, which I do not understand the reason. I was expecting to see three aligned sequences. Do you have any comment on this?

ADD REPLY • link 4.6 years ago by ali_karimnezhad ▴ 20

0

Entering edit mode

Yep, these sequences are too large to be aligned with common multiple-alignment software I just focused on the error you were getting from Megax. You can do what Mensur says or you can just select a shorter random sequence and align it. Something between 5000 and 20000 bp. You'll definitively get a phylogenetic tree. How informative will it be? I don't know. I guess it depends on what you need it for. You can also make trees with different chunks of data and later make an average tree. I guess that will be easier but not sure if it will be much less informative.

ADD REPLY • link 4.6 years ago by juanjo75es ▴ 130

0

Entering edit mode

Thanks for your suggestion.

ADD REPLY • link 4.6 years ago by ali_karimnezhad ▴ 20

score 2 · Answer 1 · 2021-01-10

For genomes of this size it makes little sense to align them whole. Not only is that impractical (time-consuming, and it may require very large memory), but it is not very informative in terms of biology because of codon degeneracy. It is possible to have two genomes that are 70% identical at nucleotide level but 80% identical at protein level. Aligning nucleotides is more appropriate when genomes are extremely similar or if you are comparing non-coding regions.

A common way of comparing genomes is to find a set of single-copy protein markers they all share, align those proteins individually, and then concatenate the alignments all into a super-matrix. There is a program called ezTree that will do all those steps, and build a tree in the end from the concatenated alignment. The problem you may have is that this may end up being a relatively large alignment because with only 3 species it is likely that they share thousands of protein markers. Nevertheless, it is still more feasible than what you are trying to do, and it is likely to be more informative as well.