OK, I have several hundred fragments of a protein of interest(699 sequences) that I would like to align and make a neighbor joining tree of. These fragments do not in many cases align well to one another (different regions of the same or similar proteins).
However, whole protein sequence(s) have been defined and submitted to NCBI and other databases etc. There are also trees made in literature for this protein. Is there a way to take my fragments from my metagenome, and align them to the known sequences to define where each of my fragments lie on the published tree? my only solution to this is to run each sequence (or cluster of sequences) on the predefined tree (using the original whole protein sequences from publication) so as to define where each fragment would lie.
My sequences are non assembly sequences (can't assemble them, too diverse)
Average read length is 400bp
General protein length is around 350aa
IS there an easier way to do this?
How accurate would diversity statistics be on this protein? (will not be adding the known protein sequence for this one)
Thanks for any advice/help in advance.
PAGAN could be helpful in the alignment part. Please see http://code.google.com/p/pagan-msa/wiki/PAGAN?tm=6 and contact the author if you have any questions. The program is actively developed and recent features (e.g. translated and ORF alignment) are still undocumented.
You could try (1) "pileup alignment" (one ref. sequence) and (2) "unguided placement (ref. alignment and tree):