Is FastML a good server for reconstructing ancestral protein sequence?
1
3
Entering edit mode
10.5 years ago
lia.elias ▴ 40

After a MSA using clustal, can I load the alignment in FASTA format to FastML to get a reliable ancestor protein sequence prediction?

sequence ancestor ancestral reconstruction msa • 7.7k views
ADD COMMENT
0
Entering edit mode

Hello,

I have a question: how to calculate the difference in log-likelihood between the most likely ancestral sequence at node N1 and the 100th most likely sequence??

thanks in advance

laura

ADD REPLY
3
Entering edit mode
10.5 years ago
JulianZ ▴ 70

Yes, you can use FastML to get reliable ancestor sequences. I should note that the previous steps leading to this need to be performed very carefully, i.e., your alignment and tree need to be constructed correctly. The REAP protocol is pretty good at discussing the steps involved (paywall).

FastML itself can be annoying to use. Occasionally the server will not work (for whatever reason) and at other times very minor formatting issues in your supplied alignment or tree will cause it to break (and the server won't always tell you what the issue is).

An alternative is CODEML in the PAML package... although this is incredibly difficult to use and the documentation is crap hard to interpret.

I can expand on the sequence of steps that I believe need to be taken for ASR (formatting tricks, programs to use, etc.) if required.

ADD COMMENT
0
Entering edit mode

Thank you very much JulianZ...

I would like to get the sequence of steps if you can share. I am a beginner in the field

Thanks

ADD REPLY
2
Entering edit mode

Sorry it took so long to reply.

I should point out that this is by no means the best combination of steps, but it is the protocol I generally follow. I borrowed some descriptions from the paper I linked to in my previous post, I am also going to assume this is for protein reconstruction as this makes describing the steps simpler.

Step 1: Collecting sequences

Firstly you need to collect extant homologous sequences. Generally the idea is to select sequences that provide a diverse snapshot of the protein family of interest, i.e., sequences from a range of different evolutionary lineages/domains. The level of diversity in this initial selection of sequences directly influences the possible properties in your reconstructed ancestral sequences. However, if your protein family of interest has low sequence similarity when comparing between certain sequence groups, e.g., comparing Fungi sequences against Mammalia sequences, try choosing the sequences more relevant to your desired goal, e.g., are you more interested in Fungi-sequence properties or Mammals? I say this as in the alignment step, you are going to be getting rid of any sequences that align badly to everything else, hence this will save you some time later.

Generally the number of sequences selected should be between 50-200 sequences, however again, this does depend on what you are trying to reconstruct and the sequence redundancy in your chosen set of sequences.

Step 2: Alignment

In this step you are creating an alignment of your sequences. There are a number of alignment tools freely available (Clustal, T-Coffee, MUSCLE etc.), however my personal favourite is the MAFFT webserver, which is useful as it provides options to trim your alignment, creates trees etc. I generally use a scoring matrix like BLOSUM80, however this depends on how closely related your sequences are.
Once you create your initial alignment, you need to review its quality:

  • Remove any sequences that are significantly longer or shorter than the average sequence length for the protein family. A rough threshold may be ~15% shorter or longer, but this depends on your tolerance for variation.
  • Remove any sequences with too many insertions and deletions, these sequences make aligning difficult.
  • You should have a good idea of what residues/motifs are highly conserved in your family of interest, i.e., probably required for a functional sequence. Go through your alignment and remove sequences which do not include these conserved regions (or at least do not include physicochemically similar residues).
  • If you see an insertion in only one sequence, you probably can just remove that column from the alignment. However some insertion and deletions can be useful for providing functional diversity in your alignment...so review these carefully.

Keep repeating this process until the alignment looks satisfactory.

Step 3 (Optional): Sequence redundancy This step could be performed before step 2...

Once you have your alignment, you can use a program such as CD-HIT to cluster your sequences based on their sequence similarity. Generally you want to set a threshold for similarity of 90%, however try experimenting with different thresholds for the best results. This will reduce sequence bias in your final alignment. Check what sequences are in your alignment after this process to make you sure you have sequences you desire.

Step 4: Phylogenetic Tree

There are a number of tools out there for creating phylogenetic trees - MrBayes, BEAST, PAUP, RAxML, MEGA. I use MEGA for simplicity and formatting reasons. Use one of these programs and create your tree with your desired method (maximum-likelihood, Bayesian inference, parsimony). I generally use maximum likelihood and perform bootstrapping x50-100 (though really you should do more....like x500-1000). Bootstrapping takes awhile to explain, so I would go read up on it, but basically it is a way for us to evaluate how 'robust' our tree is compared to the many possible alternatives.

Once you have your tree, compare it against any published trees/phylogenetic information. If you see discrepancies between your tree and a published tree, review your tree carefully for any obvious errors. You can modify your sequence alignment to fix possible errors in your tree. You can also manually edit your tree to fix any obvious errors by moving species between clades.

Step 5: Ancestral Sequence Reconstruction (ASR) with FASTML

For the final step, you simply have to submit your alignment and the corresponding tree file to the FASTML webserver. I have had some issues with sequence names in the files, so make sure all sequence headers in the FASTA file are in and match those in the Newick tree file. Also, I suspect the web-server has issues with the symbol |. So replace this symbol with _ in all your sequence names.

Feel free to ask for specific details, I am sure I missed some things.

ADD REPLY
1
Entering edit mode

Thank you very much JulianZ. I have a doubt regarding the output we receive from the FastML. How do we know what is N1, N2 etc? i.e., How is this numbering done with the nodes? Can you please clear my doubt ASAP?

ADD REPLY
1
Entering edit mode

N1 is the oldest ancestor. After that, all other internal nodes of the phylogenetic tree are labelled in order ....N2, N3...etc. N2 and N3 for example would be children of N1. Extant sequences should maintain their original labels. I believe FastML provides a newick tree in one of the output files with the ancestor labeling included...so plotting that should give you a more `visual' idea of how the nodes are labeled.

ADD REPLY
0
Entering edit mode

Thank you very much JulianZ. Your inputs gave me an overall idea of ASR. I have tried it out for my sequences. I will get back if I have any more queries.

Thank you again

ADD REPLY
0
Entering edit mode

Hi

I have a few more doubts. Which kind of output should we consider regarding FASTML? Is it joint reconstruction file or marginal construction with/without indels?

One more thing - I have used an outgroup while making the tree. Will this interfere with the accurate ASR?

Please answer...

ADD REPLY
1
Entering edit mode

Hmm good question. Depends on what your ultimate goal is, but typically you will just use the joint reconstruction (If you read up on the theory on the difference between joint and marginal this will make more sense). The marginal reconstruction is useful if you want to compare against the joint and get a sense of what positions were difficult for Fastml to infer (at least I think so?)....e.g. a position in an ancestor could be 50% likely to be a 'G' and 50% likely a 'C', rather than 90% likely.

And yes, you are meant to use an outgroup (depending on what you are reconstructing). I forgot to mention that before.

ADD REPLY
0
Entering edit mode

Thank you very much.

ADD REPLY

Login before adding your answer.

Traffic: 2931 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6