I am CS major with little biology background. I am doing some experiments, and at some point, I need to simulate DNA alignments along a phylogenetic tree (which is produced under Yule process using r8s program). I was able to produce "some" sequence using Seq-Gen, Seq-Gen manual, with the following parameters from that tree:
seq-gen -o p -m GTR -i 0.01 -f 0.3 0.2 0.2 0.3 -s 0.5 -z <some_seed> -l 2000
where
- -o"output_file_format"
- -m"MODEL"
- -i"PROPORTION_INVARIABLE"
- -f"STATE_FREQUENCIES"
- -s"BRANCH_LENGTH_SCALE"
- -z"RANDOM_NUMBER_SEED"
- -l"SEQUENCE_LENGTH"
The problem is that when I use RAxML to obtain a maximum likelihood tree from this sequence, the resulting tree is very different from the one from which the sequence data was generated, namely their RF distance is almost maximum possible. I suspect there is something wrong with my parameters of Seq-Gen. Probably they are not biologically sensible, or I need to use more parameters like -r option to specify substitution rates? (If that matters, I used GTRGAMMA model for RAxML)
Update:
Somehow I realized when I change branch lengths of the tree, the problem is mitigated. So my #0 question is how to set branch lengths of a (say ultrametric) phylogeny such that it biologically makes sense? Is there a standard way of doing this? Any software is available for this?
Just as a sanity check, could you generate a pairwise distance matrix for the sequences that were generated? Plus remember that seq-gen is probabilistic and a stochastic process may generate sequences that are different from the coalescent tree by pure chance.