I have 9 very long sequences with average length of 620513. I need to align them for phylogenetic analysis. How can I align such big sequences?
THanks
I have 9 very long sequences with average length of 620513. I need to align them for phylogenetic analysis. How can I align such big sequences?
THanks
I've tested:
With varying degrees of accuracy/quality.
I've also done up to about 30kb with CLUSTALO in the past which seemed to work reasonably well.
Kalign and LAST are specifically intended for long sequences though, so start there.
"Muscle" mentions the following:
"2.3 Large alignments If you have a large number of sequences (a few thousand), or they are very long, then the default settings of may be too slow for practical use. A good compromise between speed and accuracy is to run just the first two iterations of the algorithm. On average, this gives accuracy equal to T-Coffee and speeds much faster than CLUSTALW. This is done by the option –maxiters 2, as in the following example.
muscle -in seqs.fa -out seqs.afa -maxiters 2
"
Just like fishgolden stated, the alignment should have come before concatenation. I am not sure if Alignment before concatenation and after concatenation would produce the same super-matrix. I once did but I opted for before concat and I think it has a higher level of certainty compare to aligning a huge and long sequences, which would definitely be prone to error
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Are you certain that those sequences are related by phylogeny so that a MSA can be logically constructed? If you are not sure about that trying to align sequences may result in a not-logical alignment.
You could use a program like mauve to see if the sequences are related (i.e. there are not rearrangements etc) before trying the MSA.
yes I am certain because I obtained these through orthology detection tools. I obtained 356 for each species, so that there are 356 orthologous groups wherein each group has no more than one gene. In other words each group has single copy of an orthololog across all the species. After that I merged all the single copy orthologs of a species to create a single super sequence for each species. That is why these sequences are too big. This I did to generate a species tree and not the gene tree.
If you merge different genes and align them, regions in boundary of different genes will aligned to next or previous unrelated genes accidentally. Those regions are just "noise" and disturb following analysis. (If the genes are sorted according to the position of chromosomes, the story may be different.)
Thus the procedure should not be
merge -> align
should be
align -> merge
.
In my opinion.
Very good point. Merge the individual gene alignments!
Consider using ASTRAL-II to infer a species tree from gene trees of all the ortholog groups instead. It may well be faster than trying to align extremely long sequences, not to mention accuracy often suffers for very long alignments.
By doing what you had asked in the other thread (HOw to merge multifasta sequence into a single sequence having only one header? )? I am not sure how you can do meaningful phylogenetic analysis by concatenating sequence of multiple genes into a single artificial sequence for each species.
I think I've heard of people doing concatenated multifasta alignments before now, I wouldn't like to vouch for how good an idea it is, but I think it's somewhat accepted (presumably the sequences are reasonably similar anyway as they were probably identified as orthologs with like a 70% nt ID or something)..
That is the critical piece. Hopefully OP has done the due diligence.
One could still make a phylogeny by incorporating the species/gene_names in the headers and keeping the sequences separate. It would be an interesting way to see if the orthologs identified follow logical pattern or if there are mistakes.
you could try using mafft
BWA is a good option
IT is an aligner, I need to perform MSA, multiple sequence alignment for phylogenetic analysis
However given the length of sequences I guess it would be too much for most existing tools unless they are run on a high capacity computer cluster