Phylogenetic Tree from Massive Multifasta Alignment?
1
0
Entering edit mode
3.2 years ago
jdru ▴ 10

Hi all,

I have a very large (~30,000 sequence, each ~17000 bases) multifasta alignment and I am wondering if this is too large to construct a phylogenetic tree? If not, which program would be most appropriate for this use case?

Thank you!

tree alignment fasta phylogeny • 2.1k views
ADD COMMENT
0
Entering edit mode

How was the multifasta generated? Generally I would be very skeptical of the quality of any MSA of that size. Most tools break down long before that.

ADD REPLY
0
Entering edit mode

It was generated with MAFFT. I agree, the construction of the tree is actually part of post-processing/quality checking

ADD REPLY
0
Entering edit mode

I would suggest using RAxML-NG or iqtree. I believe that iqtree is faster than RAxML though.

ADD REPLY
1
Entering edit mode

Unless OP has thousands of cores, I think he would be better off with e.g. fasttree

ADD REPLY
0
Entering edit mode

IIRC iqtree has a fast mode which performs comparatively to fasttree

ADD REPLY
0
Entering edit mode

Just curious: any reason you have and use two accounts?

ADD REPLY
0
Entering edit mode

Oh sorry, I forgot I had already made an account this summer to ask a question (before getting my DTU email). I will go delete the old one.

ADD REPLY
1
Entering edit mode
3.2 years ago
Mensur Dlakic ★ 28k

Unless you are starting a new classification (new tree of life?) or building some sort of public database, 30K sequences is completely unnecessary. For just about any other purpose I can think of, that many sequences is an overkill. For publications or for grants, it is not practical to inspect trees that have more than few hundred branches, and even those would have to be collapsed into groups.

Your purpose for doing this aside, it will be difficult to get this tree to converge. With IQ-TREE in the fast bootstrap mode (a minimum of 1000 bootstraps which may not be enough for you) and 20-40 CPUs, it takes half a day for a protein alignment of ~150 sequences that are ~15,000 residues each. This may give you some idea about the time needed when you scale it up to what you have - and I don't think it scales up linearly.

If you still want to do it, you may want to give this a look:

https://cme.h-its.org/exelixis/web/software/examl/index.html

ADD COMMENT

Login before adding your answer.

Traffic: 1645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6