Multiple alignment of genome assemblies to generate a tree
1
1
Entering edit mode
14 months ago
pramach1 ▴ 40

I have 2000 genome assemblies(each fasta might have 5-6 contigs) of a particular serovar of Salmonella in 2000 separate fasta files(essentially 2000 fasta files). I need to align them to each other so that I can generate a tree. What would be the best tool to align these sequences and the aligned output file needs to be a single fasta file. Thank you.

genome-assemblies multiple-sequence-alignment • 892 views
ADD COMMENT
0
Entering edit mode

Why do you need a tree? A tree with 2,000 tips isn't really good for anything and it would take pretty much forever to build..

I suggest all-vs-all Mash (choose k and s wisely) and subsequently cluster the resulting distance matrix with affinity propagation. Even with a laptop, this takes but a few minutes. If you use the R AP implementation, you can also output a heatmap with a dendrogram..

ADD REPLY
0
Entering edit mode

Thank you. This answer led me to discover mashtree. I have access to a cluster and it was really fast. Thank you.

ADD REPLY
0
Entering edit mode
14 months ago
Mensur Dlakic ★ 28k

It is pretty much impossible, and for visual purposes intractable, to align 2000 full bacterial genomes. If you really want to stick with it, this program will make a cladogram based on average nucleotide identity:

https://github.com/MrOlm/drep

Beyond that, I suggest you remove all the genomes that are less than 20-30% complete, select a set of ~100 single-copy protein markers, and then make a tree based on concatenated protein alignments. That is going to be more tractable than actually making genome alignments.

ADD COMMENT

Login before adding your answer.

Traffic: 2149 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6