I have two protein sequences with around 50% identity between them.
I want to study the phylogenetic relationship between them.
I came up with a method myself:
Step 1: blast each of the sequences to a protein database separately (possibly with less stringent thresholds)
Step 2: extract the subject sequences which are hits common to both blasts
Step 3: multiple sequence alignment using the common subject seqeunces and the two query sequences
Step 4: build the phylogenetic tree
Could anyone comment on this method? If it is not ideal, what is the standard way of preparing homolog sequences for a phylogenetic analysis?
Thank you.
These are the typical steps. However, you'll need to figure out the details, e.g. which species to include, maybe manually tweak the multiple sequence alignment, which tree building algorithm to choose.
To add to Jean’s answer about which species to include, you may also want to consider a species or sequence that is less related to be an outgruop if you want a rooted tree.
For the outgroup, should it be either 'out' in the sense of 1) blast threshold 2) functional annotation of the protein or 3) both?
It should be a more divergent sequence, which would lead to it being the outer most branch in your final tree. I.e. it will be one half of the most basal node bifurcation.
I see! What could we achieve if we play with the sequence alignment step?
Thats too broad of a question really. You need to decide what features you’re looking for. If you wanted to examine preservation of an active site or domain for instance, you’d want to use local alignments, but if you were perhaps interested in the overall gene conservation, a global alignment would be more informative most likely.