running RAxML and MrBayes over concatinated multiple sequence alignments
1
0
Entering edit mode
5.2 years ago
Moses ▴ 150

Hi All,

I have 400 species and 57 marker genes, such that my genomes have missing data. On average each genome has only 75% of the marker genes (missing data is entirely randomly distributed), i.e. each of the marker genes are found in 300 species on average.

I want to infer phylogeny between these 400 species using all of these marker genes (i.e. with missing data) using two tools, namely RAxML and MrBayes. What I did is I constructed individual multiple sequence alignments for each of the 57 markers, so now I have 57 multiple sequence alignments and then concatinated these alignments next to each other. For each genome missing a gene I simpy put dashes during concatiation for that genome for that marker gene and continued.

My question is how do I issue the command using RAxML to infer phylogeny over this dataset using this multiple sequence alignment? I was reading over he documentation and some threads online that I need to use the “Partitioned models” parameter for RAxML and tell the program that these sequences are concatinated. I did not understand how I add this parameter, my understanding is that I also need to specify a file that identifies my paritions? and I need to tell RAxML to use different evolutionary models for each partition? this is the bit that is confusing me. Right now I just issued the following command to infer phylogeny:

raxmlHPC -m PROTGAMMAAUTO -s pfamNCBIGeneSeqs_400_MSA_withMissingData.fasta -n pfamNCBIGeneSeqs_400_MSA_withMissingData_raxml.tree -T 50 -p 12345

Is there another parameter that I should add for this scenario? Also what is the equivalent command using MrBayes to infer phylogeny over the same data?

Thank you for your time!

RAxML MrBayes phylogeny maximum likelihood • 3.4k views
ADD COMMENT
2
Entering edit mode

A tree of this size will take days with RAxML if you have hundreds of CPUs available, weeks otherwise. As to MrBayes, we are talking months. I can't even estimate memory requirements, but they will be substantial as well. There is a good chances that your MrBayes run will be interrupted, so I would make sure you understand how its checkpointing works so you can continue a run without losing data.

In light of this, I suggest you consider three possibilities: 1) Work with fewer species; 2) If each marker gene is present in 75% of species on average, I'd venture a guess that some of them are present in less then half and I would exclude them from the analysis; 3) Find your parameters beforehand using ProtTest or IQ-TREE.

ADD REPLY
0
Entering edit mode

woww!! if we're talking about weeks? this is only my simulation data, where I have extracted a subset of 400 species reference sequence species and I removed on purpose 25% of my marker genes, so I know each genome has 75% of the genes (different subsets). My actual dataset's size is 3,300 species with missing data, the context here is to try and infer phylogenies over genomes with missing data, then I guess RAxML and MrBayes is a no go for me for this task?

ADD REPLY
1
Entering edit mode

Both IQ-TREE and FastTree are faster than RAxML, by 10-50x. RAxML has been superseded by RAxML-NG, which is supposedly faster, by I haven't seen benchmarks.

I am not up-to-date regarding Bayesian phylogeny software, but I think BayesAss is faster than MrBayes.

ADD REPLY
1
Entering edit mode

My suggestion to you is to find someone who has done this on a scale comparable to yours, and get some ideas about what it takes. Maybe this paper will give you some ideas, though they used only 16 (relatively small) proteins. In my limited experience with large datasets, FastTree may be faster choice for ML analysis, and Exabayes is definitely faster than MrBayes.

The best I can do is explain how my data relates to yours and how long tree reconstructions took. I have done what I will describe here on ~55 concatenated proteins (~15000 residues in concatenated alignment) and 100-120 different species. That is probably similar to your alignment size, but much lower in terms of species. I don't have exact RAxML data handy, but on this alignment and with 30-40 CPUs it takes 2-3 days for 100 fast bootstraps. Scaling to your data, that would mean many days, or possibly weeks if you don't have access to high computing resources. Depending on the exact parameters, a MrBayes run on the same alignment takes 2-4 weeks for about 2 million sampling generations, which may or may not be enough for your purposes. I would say you should forget about MrBayes when thinking about thousands of species, because its multi-threading options are limited to the number of parallel runs by sampled chains, and it won't help you much even if you have hundreds of CPUs available. In my limited experience with GPUs and MrBayes, that will not help you enough either when doing thousands of species.

ADD REPLY
0
Entering edit mode

I see, thank you all for your suggestions and pointing out these limitations, these will save me a great deal.

ADD REPLY
1
Entering edit mode

The script BeforePhylo can concatenate the alignments and create the RAxML partition table for you.

ADD REPLY
0
Entering edit mode

and that table gets fed as RAxML input? and if so what is the parameter specifying the input? is it the -q?

ADD REPLY
0
Entering edit mode

Yes, the partitions file is assigned with -q.

ADD REPLY
0
Entering edit mode

this BeforePhylo script seems to be broken, the output files are messed up and not compatible with RAxML even after specifying the concatination and the partition file to be RAxML format, instead I used the catsequences program and seems to be working so far .https://github.com/ChrisCreevey/catsequences

ADD REPLY
2
Entering edit mode
5.2 years ago
Moses ▴ 150

After trying many script to concatinate sequences and come up with paritions file for RAxML, catsequences https://github.com/ChrisCreevey/catsequencescatsequences seem to be working

ADD COMMENT

Login before adding your answer.

Traffic: 3845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6