Sorry it took so long to reply.
I should point out that this is by no means the best combination of steps, but it is the protocol I generally follow. I borrowed some descriptions from the paper I linked to in my previous post, I am also going to assume this is for protein reconstruction as this makes describing the steps simpler.
Step 1: Collecting sequences
Firstly you need to collect extant homologous sequences. Generally the idea is to select sequences that provide a diverse snapshot of the protein family of interest, i.e., sequences from a range of different evolutionary lineages/domains. The level of diversity in this initial selection of sequences directly influences the possible properties in your reconstructed ancestral sequences. However, if your protein family of interest has low sequence similarity when comparing between certain sequence groups, e.g., comparing Fungi sequences against Mammalia sequences, try choosing the sequences more relevant to your desired goal, e.g., are you more interested in Fungi-sequence properties or Mammals? I say this as in the alignment step, you are going to be getting rid of any sequences that align badly to everything else, hence this will save you some time later.
Generally the number of sequences selected should be between 50-200 sequences, however again, this does depend on what you are trying to reconstruct and the sequence redundancy in your chosen set of sequences.
Step 2: Alignment
In this step you are creating an alignment of your sequences. There are a number of alignment tools freely available (Clustal, T-Coffee, MUSCLE etc.), however my personal favourite is the MAFFT webserver, which is useful as it provides options to trim your alignment, creates trees etc. I generally use a scoring matrix like BLOSUM80, however this depends on how closely related your sequences are.
Once you create your initial alignment, you need to review its quality:
- Remove any sequences that are significantly longer or shorter than the average sequence length for the protein family. A rough threshold may be ~15% shorter or longer, but this depends on your tolerance for variation.
- Remove any sequences with too many insertions and deletions, these sequences make aligning difficult.
- You should have a good idea of what residues/motifs are highly conserved in your family of interest, i.e., probably required for a functional sequence. Go through your alignment and remove sequences which do not include these conserved regions (or at least do not include physicochemically similar residues).
- If you see an insertion in only one sequence, you probably can just remove that column from the alignment. However some insertion and deletions can be useful for providing functional diversity in your alignment...so review these carefully.
Keep repeating this process until the alignment looks satisfactory.
Step 3 (Optional): Sequence redundancy This step could be performed before step 2...
Once you have your alignment, you can use a program such as CD-HIT to cluster your sequences based on their sequence similarity. Generally you want to set a threshold for similarity of 90%, however try experimenting with different thresholds for the best results. This will reduce sequence bias in your final alignment. Check what sequences are in your alignment after this process to make you sure you have sequences you desire.
Step 4: Phylogenetic Tree
There are a number of tools out there for creating phylogenetic trees - MrBayes, BEAST, PAUP, RAxML, MEGA. I use MEGA for simplicity and formatting reasons. Use one of these programs and create your tree with your desired method (maximum-likelihood, Bayesian inference, parsimony). I generally use maximum likelihood and perform bootstrapping x50-100 (though really you should do more....like x500-1000). Bootstrapping takes awhile to explain, so I would go read up on it, but basically it is a way for us to evaluate how 'robust' our tree is compared to the many possible alternatives.
Once you have your tree, compare it against any published trees/phylogenetic information. If you see discrepancies between your tree and a published tree, review your tree carefully for any obvious errors. You can modify your sequence alignment to fix possible errors in your tree. You can also manually edit your tree to fix any obvious errors by moving species between clades.
Step 5: Ancestral Sequence Reconstruction (ASR) with FASTML
For the final step, you simply have to submit your alignment and the corresponding tree file to the FASTML webserver. I have had some issues with sequence names in the files, so make sure all sequence headers in the FASTA file are in and match those in the Newick tree file. Also, I suspect the web-server has issues with the symbol |
. So replace this symbol with _
in all your sequence names.
Feel free to ask for specific details, I am sure I missed some things.
Hello,
I have a question: how to calculate the difference in log-likelihood between the most likely ancestral sequence at node N1 and the 100th most likely sequence??
thanks in advance
laura